Code Generation & Program Synthesis
CapabilitiesAI-driven code generation, program synthesis, automated debugging, and software engineering with LLMs.
Keywords
Recent Papers
This paper investigates the impact of data imbalance on deep learning-based software vulnerability detection using nine open-source datasets and two state-of-the-art DL models. The study confirms that data imbalance significantly affects model performance and that existing imbalance solutions exhibit varying effectiveness across datasets and evaluation metrics. The authors find that focal loss improves precision, mean false error and class-balanced loss improve recall, and random over-sampling improves F1-measure, but no single solution excels across all metrics.
Empirically demonstrates the significant impact of data imbalance on deep learning models for software vulnerability detection and evaluates the effectiveness of existing imbalance solutions across multiple datasets and metrics.
This paper presents an empirical study of AI coding agent contributions in open-source Android and iOS mobile app development by analyzing 2,901 AI-authored pull requests (PRs) from 193 GitHub repositories. The study reveals that Android projects receive more AI-authored PRs and exhibit higher acceptance rates compared to iOS, with routine tasks showing higher acceptance rates than structural changes. The analysis also indicates an initial improvement followed by a decline in PR resolution time on Android, providing insights into the evolving impact of AI agents on OSS mobile projects.
Empirically characterizes the effects of AI coding agents on open-source Android and iOS mobile app projects by analyzing PR acceptance behaviors across platforms, agents, and task categories.
The authors introduce Text2GQL-Bench, a new benchmark for text-to-graph query language translation, comprising 178,184 question-query pairs across 13 domains and supporting multiple graph query languages. They also present a comprehensive evaluation method that assesses grammatical validity, similarity, semantic alignment, and execution accuracy, moving beyond simple end-to-end metrics. Experiments reveal a significant "dialect gap" in ISO-GQL generation, where even strong LLMs struggle in zero-shot settings but improve substantially with few-shot prompting or fine-tuning.
Introduces a unified benchmark, Text2GQL-Bench, for evaluating text-to-graph query language systems, featuring a multi-GQL dataset and a scalable construction framework.
The paper introduces ModelWisdom, a toolkit designed to enhance the interpretability and usability of TLA+ model checking by addressing challenges in counterexample analysis and model repair. ModelWisdom integrates visualization techniques, graph optimization, LLM-based summarization, and automated repair suggestions to improve the debugging process. The toolkit's capabilities, including colorized violation highlighting, graph folding, and LLM-powered explanations, facilitate a more interactive and understandable workflow for TLA+ specifications.
Introduces an interactive environment, ModelWisdom, that leverages visualization and large language models to improve the interpretability and actionability of TLA+ model checking.
This paper investigates the effectiveness of repository-level context files (e.g., AGENTS.md) in improving the performance of coding agents on software development tasks. Through experiments on SWE-bench tasks with LLM-generated context files and a novel dataset of issues from repositories with developer-committed context files, the authors find that context files generally decrease task success rates and increase inference costs. They attribute this to unnecessary constraints imposed by the context files, suggesting that human-written context files should be minimal.
Empirically demonstrates that repository-level context files, both LLM-generated and human-written, can hinder the performance of coding agents on software development tasks.
The paper introduces PhyNiKCE, a neurosymbolic agentic framework that addresses the limitations of LLMs in autonomous CFD by decoupling neural planning from symbolic validation. PhyNiKCE uses a Symbolic Knowledge Engine to enforce physical constraints via a Deterministic RAG Engine, treating simulation setup as a Constraint Satisfaction Problem. Experiments using OpenFOAM and Gemini-2.5-Pro/Flash demonstrate a 96% improvement over baselines, a 59% reduction in self-correction loops, and a 17% decrease in LLM token consumption.
Introduces PhyNiKCE, a neurosymbolic framework that integrates neural planning with symbolic constraint enforcement to improve the reliability and efficiency of autonomous CFD agents.
This paper investigates GPT-5's ability to learn Idris, a functional programming language, through iterative prompting strategies. The authors found that zero-shot performance on Idris programming exercises was significantly lower than performance on Python and Erlang. By incorporating local compilation errors into the prompts, the authors achieved a substantial performance increase, solving 54 out of 56 problems.
Demonstrates that compiler-guided, error-driven iterative prompting significantly improves GPT-5's performance in a low-resource programming language.
The paper introduces MING, an MLIR-based framework for automating the HLS design process of CNNs targeting resource-constrained edge FPGAs. MING employs a streaming architecture with optimized buffer management to address the limitations of existing frameworks in handling stringent resource constraints. Experiments demonstrate that MING achieves significant speedups (15x for multi-layer CNN kernels and up to 200x for single-layer kernels) and can generate efficient designs for larger input sizes where other frameworks fail.
Introduces an MLIR-based framework, MING, that automates HLS design for CNNs on resource-constrained edge FPGAs using a streaming architecture with optimized buffer management.
The paper introduces Execute-Summarize (ES), a framework that decouples task execution from workflow construction in LLMs, addressing the challenge of accurately translating LLM reasoning into structured workflows. ES first completes the task using available tools and then independently reconstructs a structured workflow from execution traces. Experiments on the newly introduced FlowBench demonstrate that ES outperforms existing methods, establishing a more reliable paradigm for grounding free-form LLM reasoning into structured workflows.
Introduces Execute-Summarize (ES), a novel framework that decouples task execution and workflow construction to improve the accuracy and robustness of structured workflow generation from LLM reasoning.
This paper introduces an ML-driven physical synthesis framework for RF circuits that addresses limitations of prior ML approaches by incorporating EM-accurate component models and routing capabilities. They trained a neural network on a large dataset of inductor geometries to predict Q-factor with high accuracy, enabling gradient-based layout optimization. The framework integrates a P-Cell optimizer and a placement/routing engine with EM spacing rules, resulting in DRC-aware GDSII layouts.
Introduces an end-to-end ML-driven framework for RF physical synthesis that generates manufacturable GDSII layouts by integrating EM-aware neural inductor modeling with intelligent placement and routing.
The paper introduces PPTAM$\eta$, a CI/CD pipeline integrated with GitLab CI, designed to measure the energy consumption of containerized API systems during rapid deployment cycles. It addresses the gap in current CI/CD practices by incorporating power and energy measurement, revealing the impact of code changes on energy efficiency. The evaluation on a JWT-authenticated API demonstrates the pipeline's ability to collect performance and energy metrics across different commits, enabling version comparison and trend analysis.
Introduces an automated CI/CD pipeline, PPTAM$\eta$, that integrates power and energy measurement into GitLab CI for containerized API systems, enabling energy-aware development.
This paper introduces an online reinforcement learning (RL) approach to improve the high-performance computing (HPC) code generation capabilities of large language models (LLMs) by using runtime performance (GFLOPS) on a supercomputer as a direct reward signal. They propose a Staged Quality-Diversity (SQD) algorithm that progressively varies optimization techniques to encourage diverse learning. The authors trained Qwen2.5 Coder 14B on a double-precision matrix multiplication task using Group Relative Policy Optimization (GRPO), demonstrating improved HPC code generation.
Demonstrates that online reinforcement learning with real-machine benchmark rewards and staged optimization significantly improves the HPC code generation performance of LLMs.
This paper introduces MalTool, a framework leveraging coding LLMs to automatically generate malicious tools that can compromise user security and privacy when used by LLM agents. The authors propose a taxonomy of malicious tool behaviors based on the CIA triad and use MalTool to synthesize both standalone malicious tools and real-world tools with embedded malicious behaviors. Experiments demonstrate MalTool's effectiveness in generating malicious tools, even with safety-aligned coding LLMs, and reveal the limitations of existing detection methods, underscoring the need for improved defenses.
Introduces MalTool, a novel framework for automatically generating malicious tools using coding LLMs, enabling a systematic study of malicious tool code implementations and their impact on LLM agent security.
This paper investigates the overlap between code review comments generated by human reviewers and those produced by ChatGPT-4, focusing on the types of quality improvements recommended. The authors manually classified 739 human-generated comments from 240 pull requests and compared them to ChatGPT-4's recommendations on the same PRs. Results indicate that while ChatGPT-4 suggests more changes overall, it only identifies 10% of the issues flagged by humans, though 40% of ChatGPT-4's additional suggestions are valuable, highlighting the complementary nature of both approaches.
Quantifies the overlap and differences in quality improvement recommendations between human code reviewers and ChatGPT-4, revealing the strengths and weaknesses of each approach.
This paper investigates the effectiveness of using small language models (SLMs) as judges to improve code generation, particularly in scenarios where large language models (LLMs) may underperform. The authors train and evaluate several state-of-the-art SLMs to discriminate between correct and incorrect code implementations, focusing on classification accuracy. Results demonstrate that modern SLMs, even without execution-based information, outperform previous approaches and achieve comparable performance to much larger LLMs when used as code rankers, offering a cost-effective alternative for code generation.
Demonstrates that modern small language models can effectively serve as code correctness judges and rankers, achieving performance competitive with much larger language models at a significantly reduced cost.
The paper introduces Code2Worlds, a framework for generating 4D dynamic scenes by formulating the task as language-to-simulation code generation. It addresses the challenges of multi-scale context entanglement and the semantic-physical execution gap by using a dual-stream architecture for disentangled object and environment generation, combined with a physics-aware closed-loop mechanism involving a PostProcess Agent and VLM-Motion Critic. Experiments on the Code4D benchmark demonstrate that Code2Worlds significantly outperforms existing methods in scene generation score (SGS) and richness, while also generating more physically plausible dynamics.
Introduces a novel framework, Code2Worlds, that leverages coding LLMs to generate physically plausible 4D dynamic scenes through a dual-stream architecture and physics-aware closed-loop refinement.
The paper introduces DICE, a diffusion large language model (dLLM) specifically designed for CUDA kernel generation, addressing the limitations of autoregressive models and the scarcity of training data. They construct CuKe, a supervised fine-tuning dataset optimized for CUDA kernels, and propose a bi-phase curated reinforcement learning (BiC-RL) framework for training. Experiments on KernelBench show that DICE models (1.7B, 4B, and 8B parameters) outperform existing autoregressive and diffusion LLMs, achieving state-of-the-art results in CUDA kernel generation.
Introduces DICE, a novel diffusion-based LLM architecture and training methodology, that significantly improves CUDA kernel generation performance compared to existing autoregressive and diffusion models.
The paper introduces Hydra, a repository-level code generation framework that moves away from treating code as natural language and instead leverages its structured nature. Hydra employs a structure-aware indexing strategy using hierarchical trees, a dependency-aware retriever (DAR) to identify true dependencies, and a hybrid retrieval mechanism. Experiments on DevEval and RepoExec benchmarks demonstrate that Hydra achieves state-of-the-art performance, surpassing existing methods by over 5% in Pass@1 and enabling smaller models to outperform larger ones.
Introduces a novel repository-level code generation framework, Hydra, that leverages structure-aware indexing and dependency-aware retrieval to improve performance on complex code generation tasks.
This paper investigates the influence of team dynamics on OSS project selection by surveying 198 OSS practitioners. The study reveals that communication-related team dynamics like responsiveness and clarity are consistently prioritized, but the relative importance varies based on contributor motivations such as gaining reputation or networking. The findings demonstrate that aligning team dynamics with contributor motivations is crucial for understanding project selection behavior and designing better project recommendation systems.
Empirically demonstrates that team dynamics, particularly communication-related aspects, significantly influence OSS project selection, with the relative importance of specific dynamics varying based on contributor motivations.
The paper introduces CLUES, a framework that decomposes semantic uncertainty in Text-to-SQL into ambiguity and instability scores to differentiate between input ambiguity requiring clarification and model instability requiring human review. CLUES models Text-to-SQL as a two-stage process of interpretations to answers and computes instability using the Schur complement of a bipartite semantic graph matrix. Experiments on AmbigQA/SituatedQA and a clinical Text-to-SQL benchmark demonstrate that CLUES improves failure prediction compared to Kernel Language Entropy and provides diagnostic decomposition for targeted interventions.
Introduces CLUES, a novel framework that decomposes semantic uncertainty in Text-to-SQL into ambiguity and instability scores, enabling targeted interventions for query refinement and model improvement.
This paper introduces SB-QOPS, a search-based quantum program testing approach that uses commuting Pauli strings as test cases and a measurement-centric oracle based on their commutation properties. SB-QOPS addresses limitations of existing quantum testing methods by reducing reliance on full program specifications and enabling effective testing on real quantum computers. Empirical evaluation on circuits up to 29 qubits across IBM, IQM, and Quantinuum platforms demonstrates that SB-QOPS achieves 100% fault detection, significantly outperforming the previous QOPS approach.
Introduces a novel search-based quantum program testing approach, SB-QOPS, that leverages commuting Pauli strings and a measurement-centric oracle to improve fault detection and reduce the need for full program specifications.
This paper investigates the use of LLMs (Claude Sonnet 4.5 and GPT-5.2) for co-evolving textual Domain-Specific Languages (DSLs) and their instances when grammars change, addressing the limitations of traditional model-driven engineering techniques in preserving human-relevant information. The study systematically evaluates the correctness and information preservation capabilities of these LLMs across ten case languages and multiple runs, varying the scale and complexity of the grammar evolutions. Results indicate high performance on small-scale instances but a significant performance degradation with increasing instance size and grammar evolution complexity, highlighting current limitations in LLM-based co-evolution for larger and more complex DSLs.
Systematically evaluates the capabilities of LLMs, specifically Claude Sonnet 4.5 and GPT-5.2, in co-evolving textual DSL grammars and instances, quantifying their performance with respect to correctness, information preservation, and scalability.
The paper introduces a RAG-pipeline and two-layer prompting strategy to extract actionable recommendations (ReACTs) for improving OSS sustainability from software engineering literature. They systematically explore open LLMs and prompting techniques to derive candidate ReACTs from ICSE and FSE papers, followed by a filtering and refinement stage to ensure quality and extract supporting evidence. The pipeline generates 1,922 ReACTs, with 1,312 meeting strict quality criteria, providing a structured and scalable approach to translate research findings into practical guidance for OSS projects.
Introduces a novel RAG-pipeline leveraging LLMs to extract and structure evidence-based, actionable recommendations (ReACTs) from software engineering literature for improving OSS project sustainability.
This paper introduces a modular multi-LLM pipeline for generating agricultural simulation environments in Unreal Engine from natural language prompts, addressing limitations of existing LLM-based 3D scene generation approaches. The pipeline incorporates 3D asset retrieval, domain knowledge injection, and code generation, enhanced by LLM optimization techniques like few-shot prompting, RAG, and finetuning. Experiments demonstrate the system's effectiveness in creating realistic and semantically accurate agricultural environments, offering significant time savings compared to manual design.
Introduces a modular, multi-LLM pipeline that integrates 3D asset retrieval, domain knowledge injection, and code generation to create realistic agricultural simulation environments from natural language prompts.
This paper introduces zk-compilation, a novel approach to verifiable software provenance by executing a compiler within a zero-knowledge virtual machine (zkVM). This method generates both the compiled output and a cryptographic proof that the compilation was performed on the claimed source code with the specified compiler. The authors demonstrate the feasibility of zk-compilation using the RISC Zero zkVM and the ChibiCC C compiler, evaluating it on synthetic programs, OpenSSL, and libsodium source files, showing strong security guarantees against various attacks.
Introduces and demonstrates zk-compilation, a novel method for verifiable software provenance using zero-knowledge virtual machines.
This paper investigates the impact of few-shot prompting on the quality of LLM-generated unit tests, exploring different sources of test artifacts (human, SBST, LLM) as examples. The study evaluates the generated tests based on correctness, coverage, readability, cognitive complexity, and maintainability using GPT-4o on HumanEval and ClassEval datasets. Results demonstrate that few-shot prompting enhances test quality, with human-written examples leading to the highest coverage and correctness, and that similarity-based example retrieval further improves prompt effectiveness.
Demonstrates that few-shot prompting with human-written test examples significantly improves the quality of LLM-generated unit tests, particularly in terms of coverage and correctness, and that example retrieval based on combined problem description and code similarity optimizes prompt effectiveness.
The paper introduces GameDevBench, a new benchmark for evaluating multimodal agents in game development, a domain requiring complex code manipulation and multimodal asset handling. The benchmark comprises 132 tasks derived from tutorials, demanding significantly more code and file changes compared to existing software development benchmarks. Experiments reveal that current agents struggle with game development tasks, particularly those involving 2D graphics, but performance can be improved by incorporating image and video-based feedback mechanisms.
Introduces GameDevBench, a novel benchmark designed to evaluate and advance multimodal agents in the challenging domain of game development.
This paper examines the shift in software engineering roles due to LLMs' code generation capabilities, arguing that system architecture is becoming the primary unit of engineering value. It uses case studies from the development of two systems, *Gaari* and *The Trail*, to illustrate how the engineering bottleneck is moving from syntax to system design. The paper concludes that modern engineers must transition to a "System Architect" model focused on logic and architecture.
Argues that the core engineering value in LLM-driven development is shifting from syntax to system architecture, requiring engineers to adopt a "System Architect" mindset.
The paper introduces Dreaming in Code (DiCode), a framework that uses foundation models to generate executable environment code variations for curriculum learning in open-ended environments. DiCode addresses the challenge of discovering learnable sequences of experiences in complex environments by "dreaming" code-level variations of the world to scaffold learning. Experiments in the Craftax environment demonstrate that DiCode enables agents to acquire long-horizon skills, achieving a 16% improvement in mean return over the strongest baseline and success on late-game combat tasks where prior methods fail.
Introduces DiCode, a novel framework leveraging foundation models to synthesize executable environment code for curriculum learning, enabling agents to acquire complex skills in open-ended environments.
The paper introduces Agentic Verifier, a novel execution-based agent designed to improve the accuracy of LLMs on competitive programming tasks by actively generating discriminative test inputs to expose behavioral discrepancies among candidate solutions. This is achieved through multi-turn interaction with code execution environments, iteratively refining input generation using targeted counterexamples rather than random sampling. The agent is trained using a pipeline combining large-scale data synthesis, rejection fine-tuning, and agentic reinforcement learning, resulting in significant accuracy improvements (up to +10-15% in Best@K) across five competitive programming benchmarks compared to existing execution-based re-ranking methods.
Introduces an agentic verifier that actively generates discriminative test inputs to expose errors in candidate code solutions, significantly improving performance on competitive programming tasks.
This paper introduces VeruSyn, a data synthesis pipeline for generating a large-scale dataset of Verus-verified Rust programs to improve code-proof generation using LLMs. VeruSyn employs self-synthesis, tutorial-based synthesis, and agent trajectory synthesis to create a dataset of 6.9 million Rust programs with formal specifications and proofs. Fine-tuning a Qwen2.5-Coder-32B-Instruct model on this dataset achieves a better cost-proof tradeoff than state-of-the-art commercial models and outperforms existing research models.
Introduces VeruSyn, a novel data synthesis pipeline that generates a large-scale dataset of Verus-verified Rust programs, significantly improving the performance of LLMs in code-proof generation.
This paper introduces LLM-Geo, a framework integrating the open-source DeepSeek-Coder model (specifically the 1.3B parameter version) into a GIS platform called DS-GeoAI, to address limitations of commercial LLM-based GIS solutions. The framework aims to reduce costs and increase accessibility by eliminating API dependencies and enabling local deployment. The DS-GeoAI platform achieves 90% accuracy in generating Python code for spatial analysis tasks after automated debugging, demonstrating comparable performance to commercial solutions with significantly lower operational costs.
Demonstrates the feasibility of using a lightweight, open-source LLM like DeepSeek-Coder for complex spatial analysis tasks within a GIS framework, achieving high accuracy and significant cost reduction compared to API-based commercial solutions.
The paper introduces Soft-Verified Efficient Repository Agents (SERA), a supervised finetuning method for efficiently training coding agents specialized to private codebases. SERA leverages Soft Verified Generation (SVG) to create thousands of synthetic trajectories from a single repository, enabling rapid and cost-effective specialization. The resulting SERA models achieve state-of-the-art performance among fully open-source models, matching the performance of models like Devstral-Small-2 at a fraction of the cost compared to reinforcement learning or previous synthetic data methods.
Introduces Soft Verified Generation (SVG), a novel method for generating synthetic code trajectories that enables efficient supervised finetuning of coding agents specialized to private codebases.
This report summarizes discussions and recommendations from the NSF Workshop on AI for Electronic Design Automation, which explored the application of various AI techniques like LLMs, GNNs, and RL to EDA tasks. The workshop identified key challenges and opportunities across physical synthesis, high-level synthesis, optimization, and verification. The report advocates for NSF investment in AI/EDA collaboration, foundational AI research, data infrastructure, scalable compute, and workforce development to advance hardware design.
Synthesizes expert perspectives and recommendations on leveraging AI to address critical challenges in electronic design automation.
This paper introduces a two-stage GPU kernel tuner that combines LLM-based semantic refactoring into parameterizable templates with search-based autotuning of these parameters. By explicitly representing optimization choices as template parameters, the approach enables more controlled and systematic exploration of the optimization space compared to direct code rewriting. Experiments on CUDA kernels extracted from SGLang demonstrate speedups exceeding 3x, highlighting the effectiveness of the template-plus-search design.
Introduces a two-stage GPU kernel tuning approach that combines LLM-based semantic refactoring with search-based autotuning to achieve more stable and higher-quality speedups compared to agent-only direct rewriting.
The paper demonstrates that LLM agents can autonomously perform tensor network simulations of quantum many-body systems, a task requiring significant human expertise. They achieve this by combining in-context learning with curated documentation and a multi-agent decomposition approach, training the agents in specialized computational domains. Benchmarking on quantum phase transitions, open quantum system dynamics, and photochemical reactions shows a ~90% success rate, with the multi-agent architecture significantly reducing implementation errors and hallucinations compared to single-agent baselines.
Shows that LLM agents can autonomously perform complex tensor network simulations of quantum many-body systems with high success rates.
The paper introduces DScheLLM, a dynamic scheduling approach using a fine-tuned Huawei OpenPangu Embedded-7B large language model within a dual-system (fast-slow) reasoning architecture to address disruptions in job shop scheduling. The model is trained on datasets generated from exact schedules obtained via an operations research solver, enabling it to handle dynamic events effectively. Experiments on standard benchmarks demonstrate the fast-thinking mode generates high-quality schedules efficiently, while the slow-thinking mode produces solver-compatible decision inputs.
Introduces a novel dual-system (fast-slow) reasoning architecture leveraging fine-tuned LLMs for dynamic job shop scheduling, demonstrating adaptability to unforeseen disturbances.
This paper presents the development of KM-LLM, a generative AI tool leveraging retrieval-augmented generation (RAG) and GPT-4o to improve knowledge management processes in Iraqi higher education institutions. The study investigates the acceptance of KM-LLM by academics using the UTAUT2 framework through a survey of 10,321 academics. Results indicate the potential of KM-LLM to enhance KM processes and identify key UTAUT2 constructs influencing the intention to adopt the application.
Develops and evaluates KM-LLM, a RAG-based application using GPT-4o, for knowledge management in Iraqi higher education, providing empirical evidence on its acceptance by academics.
The paper introduces STELP, a Secure Transpiler and Executor of LLM-Generated Programs, to address the safety and reliability issues associated with directly executing code generated by Large Language Models in production systems. STELP operates by transpiling LLM-generated code into a safer, controlled environment, mitigating vulnerabilities such as data poisoning and malicious attacks. The authors demonstrate STELP's effectiveness through benchmarks on correctness, safety, and latency, showing it outperforms existing methods in safely executing risky code snippets using a newly created human-validated dataset of insecure code.
Introduces STELP, a novel system for secure transpilation and execution of LLM-generated code, enhancing safety and reliability in production environments.
This paper explores using LLMs for neural architecture search by placing a code-oriented LLM in a closed-loop synthesis framework with iterative fine-tuning based on performance feedback and novelty filtering. The LLM synthesizes PyTorch convolutional networks, which are validated, evaluated on single-epoch accuracy, and filtered for structural redundancy using MinHash-Jaccard. Results show the LLM internalizes architectural priors, improving the valid generation rate and accuracy, and synthesizing novel, high-performing architectures not present in the original training data.
Demonstrates that LLMs can be fine-tuned using execution feedback to autonomously design novel and high-performing neural architectures, moving beyond memorization of existing designs.
This study developed and evaluated an agentic AI tool leveraging LLMs and Retrieval-Augmented Generation (RAG) to automate full-text screening of publications for a systematic review of circulating biomarkers in heart failure. The tool decomposed inclusion/exclusion criteria into 136 tasks, assigned to individual LLM agents, and used a critique LLM for validation. Results showed the AI tool achieved a sensitivity of 91% and specificity of 53% in the validation phase, with greater inter-rater agreement (κ = 0.38) compared to human reviewers (κ = 0.23).
Demonstrates an agentic LLM-based AI tool's ability to automate full-text screening in systematic reviews, achieving high sensitivity and outperforming human reviewers in consistency.
The paper introduces Agent2World, a multi-agent framework for generating symbolic world models by leveraging web searching, model implementation, and adaptive unit testing. This framework grounds world model generation in multi-agent feedback, addressing the limitations of static validation methods. Fine-tuning the model with trajectories generated by the interactive testing environment leads to a substantial improvement in world-model generation, achieving a 30.95% relative gain.
Introduces a novel multi-agent framework, Agent2World, that leverages adaptive feedback from a testing team to improve the generation of symbolic world models.
The paper introduces PACIFIC, a framework for automatically generating benchmarks to evaluate LLMs' ability to follow instructions and dry-run code. PACIFIC generates benchmark variants with precise expected outputs, enabling reliable evaluation by comparing predicted and expected outputs, focusing on the LLM's intrinsic reasoning ability without relying on external tools or agentic behavior. Experiments using PACIFIC on state-of-the-art LLMs demonstrate its ability to create benchmarks of varying difficulty that effectively differentiate instruction-following and dry-running capabilities while mitigating training data contamination.
Introduces PACIFIC, a novel benchmark generation framework that isolates and evaluates LLMs' intrinsic instruction-following and code dry-running abilities, offering a scalable and contamination-resilient methodology.
The authors introduce RTLBench, a multi-dimensional benchmark suite for evaluating LLM-generated RTL code across syntax, functionality, lint compliance, readability, and style consistency, using 160 cases from textbooks and open-source projects. They evaluate 24 state-of-the-art LLMs, revealing that while syntax and functionality are reasonably addressed, engineering quality aspects are often lacking. To improve LLM-generated RTL, they propose Log2BetterRTL, a log-driven feedback system that leverages EDA tool diagnostics to iteratively refine the code, demonstrating significant improvements in various quality metrics.
Introduces RTLBench, a comprehensive benchmark suite with a multi-dimensional evaluation framework, to assess and improve the quality of LLM-generated RTL code beyond syntax and functionality.
The paper introduces MR-Size, an explainable effort estimator that predicts T-shirt sizes and interpolated day estimates for GitLab merge requests by computing a composite complexity score based on code diffs, file weights, contributor dynamics, and semantic signals. MR-Size achieves a Pearson correlation of 0.79 and a mean absolute error of 2.34 days across 150 merge requests, matching LOC baselines while providing per-file explanations. The method, datasets, evaluation plan, and reproducibility artifacts are described, with a benchmarking protocol comparing MR-Size against LOC baselines, COCOMO-style models, and learned regressors.
Introduces an explainable, repository-driven method (MR-Size) for estimating agile effort by mapping merge requests to T-shirt sizes using a composite complexity score derived from code diffs, file weights, contributor dynamics, and semantic contextual signals.
This paper investigates the applicability of open-source LLM frameworks, including both large-scale and lightweight models, for automating penetration testing tasks relevant to commercial security assessments. The study identifies both the potential and limitations of these frameworks in addressing fundamental challenges in penetration testing. The authors propose a practical approach to overcome key limitations and demonstrate the potential of LLM-based frameworks in real-world penetration testing scenarios.
Demonstrates the practical application of open-source LLM frameworks for penetration testing, highlighting their capabilities and limitations, and proposes solutions to address identified challenges.
This paper benchmarks the performance of Deep Seek Coder and Meta-llama-3-70b-instruct in detecting SQL injection vulnerabilities using a labeled dataset of malicious and legitimate SQL queries. The evaluation focuses on Boolean-based attacks and measures precision, recall, F1-score, and accuracy. Meta-llama-3-70b-instruct achieved superior recall and overall accuracy (74.00%) compared to Deep Seek Coder (60.00%), suggesting it is better at detecting a wider range of malicious queries, though both models require further refinement for standalone security analysis.
Quantifies and compares the effectiveness of Deep Seek Coder and Meta-llama-3-70b-instruct in identifying SQL injection vulnerabilities, revealing the strengths and weaknesses of each model.
The paper introduces GrowthHacker, a benchmark and framework for optimizing off-policy evaluation (OPE) using code-modifying LLM agents, addressing the limitations of online A/B testing. They developed a two-agent framework within GrowthHacker that iteratively optimizes OPE code, evaluates the results, and initiates new optimization cycles using real-world datasets from Open Bandit Pipeline and Scope-RL. Experiments demonstrate that the two-agent framework achieves 100% reliability and a 106.7% average improvement in OPE performance, outperforming other LLM agent-based approaches.
Demonstrates the feasibility and effectiveness of using code-modifying LLM agents to automatically optimize off-policy evaluation, achieving significant performance improvements over baseline methods.
This paper introduces a post-tool execution reflection mechanism that leverages LLM-based reflection and domain-specific RAG to repair failed tool calls in agentic systems. The approach uses a combination of tool-specific documentation and troubleshooting documents to identify and correct both syntactic and semantic errors that are only apparent after the tool's response is analyzed. Experiments using the kubectl command-line tool for Kubernetes management demonstrate that the RAG-based reflection improves the execution pass rate by 55% and the correctness of answers to user queries by 36% on average, with troubleshooting documents outperforming official documentation.
Introduces a novel post-tool execution reflection component that combines LLM-based reflection with domain-specific RAG to improve the reliability and accuracy of tool calls in agentic systems.
The paper introduces Blueprint2Code, a multi-agent framework designed to improve code generation by mimicking the human programming workflow through task comprehension, planning, implementation, and iterative refinement. This framework utilizes four interacting agents—Previewing, Blueprint, Coding, and Debugging—to address the limitations of LLMs in complex programming tasks requiring multi-step reasoning and reliable code generation. Experiments on HumanEval, MBPP, and APPS datasets demonstrate that Blueprint2Code achieves state-of-the-art pass@1 results, significantly outperforming existing methods, especially on extended and more challenging versions of the benchmarks.
Introduces a novel multi-agent framework, Blueprint2Code, that decomposes code generation into distinct stages handled by specialized agents to improve performance on complex programming tasks.

