Code Generation & Program Synthesis

Capabilities

AI-driven code generation, program synthesis, automated debugging, and software engineering with LLMs.

Keywords

code generationprogram synthesiscode LLMautomated programmingcode completionsoftware engineering AICodexcode repair

Recent Papers

Feb 12, 2026

Luxembourg Institute of Science and Technology2d ago

An Empirical Study of the Imbalance Issue in Software Vulnerability Detection

This paper investigates the impact of data imbalance on deep learning-based software vulnerability detection using nine open-source datasets and two state-of-the-art DL models. The study confirms that data imbalance significantly affects model performance and that existing imbalance solutions exhibit varying effectiveness across datasets and evaluation metrics. The authors find that focal loss improves precision, mean false error and class-balanced loss improve recall, and random over-sampling improves F1-measure, but no single solution excels across all metrics.

Empirically demonstrates the significant impact of data imbalance on deep learning models for software vulnerability detection and evaluates the effectiveness of existing imbalance solutions across multiple datasets and metrics.

Yuejun Guo, Qiang Hu, Qiang Tang +12602.12038

Code Generation & Program SynthesisComputer Vision

Lahore University of2d ago

On the Adoption of AI Coding Agents in Open-source Android and iOS Development

This paper presents an empirical study of AI coding agent contributions in open-source Android and iOS mobile app development by analyzing 2,901 AI-authored pull requests (PRs) from 193 GitHub repositories. The study reveals that Android projects receive more AI-authored PRs and exhibit higher acceptance rates compared to iOS, with routine tasks showing higher acceptance rates than structural changes. The analysis also indicates an initial improvement followed by a decline in PR resolution time on Android, providing insights into the evolving impact of AI agents on OSS mobile projects.

Empirically characterizes the effects of AI coding agents on open-source Android and iOS mobile app projects by analyzing PR acceptance behaviors across platforms, agents, and task categories.

Hasnain Ali, Muneeb Rana, Muhammad Saqib Ilyas +12602.12144

Code Generation & Program SynthesisTool Use & AgentsOpen-Source Models & Weights

2d ago

Text2GQL-Bench: A Text to Graph Query Language Benchmark [Experiment, Analysis&Benchmark]

The authors introduce Text2GQL-Bench, a new benchmark for text-to-graph query language translation, comprising 178,184 question-query pairs across 13 domains and supporting multiple graph query languages. They also present a comprehensive evaluation method that assesses grammatical validity, similarity, semantic alignment, and execution accuracy, moving beyond simple end-to-end metrics. Experiments reveal a significant "dialect gap" in ISO-GQL generation, where even strong LLMs struggle in zero-shot settings but improve substantially with few-shot prompting or fine-tuning.

Introduces a unified benchmark, Text2GQL-Bench, for evaluating text-to-graph query language systems, featuring a multi-GQL dataset and a scalable construction framework.

Songlin Lyu, Lujie Ban, Jirong Liu +52602.11745

Eval Frameworks & BenchmarksCode Generation & Program SynthesisTool Use & Agents

2d ago

ModelWisdom: An Integrated Toolkit for TLA+ Model Visualization, Digest and Repair

The paper introduces ModelWisdom, a toolkit designed to enhance the interpretability and usability of TLA+ model checking by addressing challenges in counterexample analysis and model repair. ModelWisdom integrates visualization techniques, graph optimization, LLM-based summarization, and automated repair suggestions to improve the debugging process. The toolkit's capabilities, including colorized violation highlighting, graph folding, and LLM-powered explanations, facilitate a more interactive and understandable workflow for TLA+ specifications.

Introduces an interactive environment, ModelWisdom, that leverages visualization and large language models to improve the interpretability and actionability of TLA+ model checking.

Zhiyong Chen, S. Cheung2602.12058

Interpretability & Mechanistic InterpCode Generation & Program SynthesisTool Use & Agents

2d ago

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

This paper investigates the effectiveness of repository-level context files (e.g., AGENTS.md) in improving the performance of coding agents on software development tasks. Through experiments on SWE-bench tasks with LLM-generated context files and a novel dataset of issues from repositories with developer-committed context files, the authors find that context files generally decrease task success rates and increase inference costs. They attribute this to unnecessary constraints imposed by the context files, suggesting that human-written context files should be minimal.

Empirically demonstrates that repository-level context files, both LLM-generated and human-written, can hinder the performance of coding agents on software development tasks.

Thibaud Gloaguen, Niels Mundler, M. Muller +22602.11988

Code Generation & Program SynthesisTool Use & AgentsEval Frameworks & Benchmarks

2d ago

PhyNiKCE: A Neurosymbolic Agentic Framework for Autonomous Computational Fluid Dynamics

The paper introduces PhyNiKCE, a neurosymbolic agentic framework that addresses the limitations of LLMs in autonomous CFD by decoupling neural planning from symbolic validation. PhyNiKCE uses a Symbolic Knowledge Engine to enforce physical constraints via a Deterministic RAG Engine, treating simulation setup as a Constraint Satisfaction Problem. Experiments using OpenFOAM and Gemini-2.5-Pro/Flash demonstrate a 96% improvement over baselines, a 59% reduction in self-correction loops, and a 17% decrease in LLM token consumption.

Introduces PhyNiKCE, a neurosymbolic framework that integrates neural planning with symbolic constraint enforcement to improve the reliability and efficiency of autonomous CFD agents.

E. Fan2602.11666

Tool Use & AgentsScientific Discovery & Drug DesignCode Generation & Program SynthesisReasoning & Chain-of-Thought

2d ago

Compiler-Guided Inference-Time Adaptation: Improving GPT-5 Programming Performance in Idris

This paper investigates GPT-5's ability to learn Idris, a functional programming language, through iterative prompting strategies. The authors found that zero-shot performance on Idris programming exercises was significantly lower than performance on Python and Erlang. By incorporating local compilation errors into the prompts, the authors achieved a substantial performance increase, solving 54 out of 56 problems.

Demonstrates that compiler-guided, error-driven iterative prompting significantly improves GPT-5's performance in a low-resource programming language.

Minda Li, Bhaskar Krishnamachari2602.11481

Code Generation & Program SynthesisEval Frameworks & Benchmarks

2d ago

MING: An Automated CNN-to-Edge MLIR HLS framework

The paper introduces MING, an MLIR-based framework for automating the HLS design process of CNNs targeting resource-constrained edge FPGAs. MING employs a streaming architecture with optimized buffer management to address the limitations of existing frameworks in handling stringent resource constraints. Experiments demonstrate that MING achieves significant speedups (15x for multi-layer CNN kernels and up to 200x for single-layer kernels) and can generate efficient designs for larger input sizes where other frameworks fail.

Introduces an MLIR-based framework, MING, that automates HLS design for CNNs on resource-constrained edge FPGAs using a streaming architecture with optimized buffer management.

Jiahong Bi, Lars Schutze, J. Castrillón2602.11966

Distributed Systems & HardwareInference & QuantizationCode Generation & Program Synthesis

2d ago

FlowMind: Execute-Summarize for Structured Workflow Generation from LLM Reasoning

The paper introduces Execute-Summarize (ES), a framework that decouples task execution from workflow construction in LLMs, addressing the challenge of accurately translating LLM reasoning into structured workflows. ES first completes the task using available tools and then independently reconstructs a structured workflow from execution traces. Experiments on the newly introduced FlowBench demonstrate that ES outperforms existing methods, establishing a more reliable paradigm for grounding free-form LLM reasoning into structured workflows.

Introduces Execute-Summarize (ES), a novel framework that decouples task execution and workflow construction to improve the accuracy and robustness of structured workflow generation from LLM reasoning.

Yihao Liu, Zile He2602.11782

Reasoning & Chain-of-ThoughtTool Use & AgentsCode Generation & Program Synthesis

2d ago

EM-Aware Physical Synthesis: Neural Inductor Modeling and Intelligent Placement&Routing for RF Circuits

This paper introduces an ML-driven physical synthesis framework for RF circuits that addresses limitations of prior ML approaches by incorporating EM-accurate component models and routing capabilities. They trained a neural network on a large dataset of inductor geometries to predict Q-factor with high accuracy, enabling gradient-based layout optimization. The framework integrates a P-Cell optimizer and a placement/routing engine with EM spacing rules, resulting in DRC-aware GDSII layouts.

Introduces an end-to-end ML-driven framework for RF physical synthesis that generates manufacturable GDSII layouts by integrating EM-aware neural inductor modeling with intelligent placement and routing.

Yilun Huang, Asal Mehradfar, Salman Avestimehr +12602.11461

Code Generation & Program SynthesisArchitecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

2d ago

PPTAM$\eta$: Energy Aware CI/CD Pipeline for Container Based Applications

The paper introduces PPTAM$\eta$, a CI/CD pipeline integrated with GitLab CI, designed to measure the energy consumption of containerized API systems during rapid deployment cycles. It addresses the gap in current CI/CD practices by incorporating power and energy measurement, revealing the impact of code changes on energy efficiency. The evaluation on a JWT-authenticated API demonstrates the pipeline's ability to collect performance and energy metrics across different commits, enabling version comparison and trend analysis.

Introduces an automated CI/CD pipeline, PPTAM$\eta$, that integrates power and energy measurement into GitLab CI for containerized API systems, enabling energy-aware development.

Alessandro Aneggi, Xiaozhou Li, Andrea Janes2602.12081

Code Generation & Program SynthesisDistributed Systems & HardwareTraining Efficiency & Optimization

2d ago

Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards

This paper introduces an online reinforcement learning (RL) approach to improve the high-performance computing (HPC) code generation capabilities of large language models (LLMs) by using runtime performance (GFLOPS) on a supercomputer as a direct reward signal. They propose a Staged Quality-Diversity (SQD) algorithm that progressively varies optimization techniques to encourage diverse learning. The authors trained Qwen2.5 Coder 14B on a double-precision matrix multiplication task using Group Relative Policy Optimization (GRPO), demonstrating improved HPC code generation.

Demonstrates that online reinforcement learning with real-machine benchmark rewards and staged optimization significantly improves the HPC code generation performance of LLMs.

Ryo Mikasa, Shun-ichiro Hayashi, Daichi Mukunoki +22602.12049

Code Generation & Program SynthesisRLHF & Preference LearningEval Frameworks & Benchmarks

2d ago

MalTool: Malicious Tool Attacks on LLM Agents

This paper introduces MalTool, a framework leveraging coding LLMs to automatically generate malicious tools that can compromise user security and privacy when used by LLM agents. The authors propose a taxonomy of malicious tool behaviors based on the CIA triad and use MalTool to synthesize both standalone malicious tools and real-world tools with embedded malicious behaviors. Experiments demonstrate MalTool's effectiveness in generating malicious tools, even with safety-aligned coding LLMs, and reveal the limitations of existing detection methods, underscoring the need for improved defenses.

Introduces MalTool, a novel framework for automatically generating malicious tools using coding LLMs, enabling a systematic study of malicious tool code implementations and their impact on LLM agent security.

Yuepeng Hu, Mengyuan Li, N. Gong2602.12194

Red-Teaming & Adversarial RobustnessTool Use & AgentsCode Generation & Program Synthesis

Università della Svizzera2d ago

Studying Quality Improvements Recommended via Manual and Automated Code Review

This paper investigates the overlap between code review comments generated by human reviewers and those produced by ChatGPT-4, focusing on the types of quality improvements recommended. The authors manually classified 739 human-generated comments from 240 pull requests and compared them to ChatGPT-4's recommendations on the same PRs. Results indicate that while ChatGPT-4 suggests more changes overall, it only identifies 10% of the issues flagged by humans, though 40% of ChatGPT-4's additional suggestions are valuable, highlighting the complementary nature of both approaches.

Quantifies the overlap and differences in quality improvement recommendations between human code reviewers and ChatGPT-4, revealing the strengths and weaknesses of each approach.

Giuseppe Crupi, Rosalia Tufano, Gabriele Bavota2602.11925

Code Generation & Program SynthesisEval Frameworks & BenchmarksNatural Language Processing

Università della Svizzera2d ago

Improving Code Generation via Small Language Model-as-a-judge

This paper investigates the effectiveness of using small language models (SLMs) as judges to improve code generation, particularly in scenarios where large language models (LLMs) may underperform. The authors train and evaluate several state-of-the-art SLMs to discriminate between correct and incorrect code implementations, focusing on classification accuracy. Results demonstrate that modern SLMs, even without execution-based information, outperform previous approaches and achieve comparable performance to much larger LLMs when used as code rankers, offering a cost-effective alternative for code generation.

Demonstrates that modern small language models can effectively serve as code correctness judges and rankers, achieving performance competitive with much larger language models at a significantly reduced cost.

Giuseppe Crupi, Rosalia Tufano, Gabriele Bavota2602.11911

Code Generation & Program SynthesisTraining Efficiency & OptimizationOpen-Source Models & Weights

2d ago

Code2Worlds: Empowering Coding LLMs for 4D World Generation

The paper introduces Code2Worlds, a framework for generating 4D dynamic scenes by formulating the task as language-to-simulation code generation. It addresses the challenges of multi-scale context entanglement and the semantic-physical execution gap by using a dual-stream architecture for disentangled object and environment generation, combined with a physics-aware closed-loop mechanism involving a PostProcess Agent and VLM-Motion Critic. Experiments on the Code4D benchmark demonstrate that Code2Worlds significantly outperforms existing methods in scene generation score (SGS) and richness, while also generating more physically plausible dynamics.

Introduces a novel framework, Code2Worlds, that leverages coding LLMs to generate physically plausible 4D dynamic scenes through a dual-stream architecture and physics-aware closed-loop refinement.

Yi Zhang, Yunshuang Wang2602.11757

Code Generation & Program SynthesisWorld Models & PlanningMultimodal Models

2d ago

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

The paper introduces DICE, a diffusion large language model (dLLM) specifically designed for CUDA kernel generation, addressing the limitations of autoregressive models and the scarcity of training data. They construct CuKe, a supervised fine-tuning dataset optimized for CUDA kernels, and propose a bi-phase curated reinforcement learning (BiC-RL) framework for training. Experiments on KernelBench show that DICE models (1.7B, 4B, and 8B parameters) outperform existing autoregressive and diffusion LLMs, achieving state-of-the-art results in CUDA kernel generation.

Introduces DICE, a novel diffusion-based LLM architecture and training methodology, that significantly improves CUDA kernel generation performance compared to existing autoregressive and diffusion models.

Haolei Bai, Jianmian Wang, Zhiqiang Tao2602.11715

Code Generation & Program SynthesisArchitecture Design (Transformers, SSMs, MoE)Data Curation & Synthetic Data

FPT Software AI Center; Hanoi2d ago

Do Not Treat Code as Natural Language: Implications for Repository-Level Code Generation and Beyond

The paper introduces Hydra, a repository-level code generation framework that moves away from treating code as natural language and instead leverages its structured nature. Hydra employs a structure-aware indexing strategy using hierarchical trees, a dependency-aware retriever (DAR) to identify true dependencies, and a hybrid retrieval mechanism. Experiments on DevEval and RepoExec benchmarks demonstrate that Hydra achieves state-of-the-art performance, surpassing existing methods by over 5% in Pass@1 and enabling smaller models to outperform larger ones.

Introduces a novel repository-level code generation framework, Hydra, that leverages structure-aware indexing and dependency-aware retrieval to improve performance on complex code generation tasks.

Minh Le-Anh, Khanh An Tran, Nam Le Hai +32602.11671

Code Generation & Program SynthesisRecommendation & Information RetrievalNatural Language Processing

Deakin University2d ago·affiliated lab: MIT CSAIL

Beyond Code: Empirical Insights into How Team Dynamics Influence OSS Project Selection

This paper investigates the influence of team dynamics on OSS project selection by surveying 198 OSS practitioners. The study reveals that communication-related team dynamics like responsiveness and clarity are consistently prioritized, but the relative importance varies based on contributor motivations such as gaining reputation or networking. The findings demonstrate that aligning team dynamics with contributor motivations is crucial for understanding project selection behavior and designing better project recommendation systems.

Empirically demonstrates that team dynamics, particularly communication-related aspects, significantly influence OSS project selection, with the relative importance of specific dynamics varying based on contributor motivations.

Shashiwadana Nirmani, Hourieh Khalajzadeh, Mojtaba Shahin2602.11692

Code Generation & Program SynthesisOpen-Source Models & WeightsRecommendation & Information Retrieval

2d ago

Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study

The paper introduces CLUES, a framework that decomposes semantic uncertainty in Text-to-SQL into ambiguity and instability scores to differentiate between input ambiguity requiring clarification and model instability requiring human review. CLUES models Text-to-SQL as a two-stage process of interpretations to answers and computes instability using the Schur complement of a bipartite semantic graph matrix. Experiments on AmbigQA/SituatedQA and a clinical Text-to-SQL benchmark demonstrate that CLUES improves failure prediction compared to Kernel Language Entropy and provides diagnostic decomposition for targeted interventions.

Introduces CLUES, a novel framework that decomposes semantic uncertainty in Text-to-SQL into ambiguity and instability scores, enabling targeted interventions for query refinement and model improvement.

Angelo Ziletti, Leonardo D'Ambrosi2602.12015

Code Generation & Program SynthesisNatural Language ProcessingEval Frameworks & Benchmarks

2d ago

Search-Based Quantum Program Testing via Commuting Pauli String

This paper introduces SB-QOPS, a search-based quantum program testing approach that uses commuting Pauli strings as test cases and a measurement-centric oracle based on their commutation properties. SB-QOPS addresses limitations of existing quantum testing methods by reducing reliance on full program specifications and enabling effective testing on real quantum computers. Empirical evaluation on circuits up to 29 qubits across IBM, IQM, and Quantinuum platforms demonstrates that SB-QOPS achieves 100% fault detection, significantly outperforming the previous QOPS approach.

Introduces a novel search-based quantum program testing approach, SB-QOPS, that leverages commuting Pauli strings and a measurement-centric oracle to improve fault detection and reduce the need for full program specifications.

Asmar Muqeet, Paolo Arcaini2602.11487

Code Generation & Program Synthesis

2d ago

Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs: A Systematic Evaluation

This paper investigates the use of LLMs (Claude Sonnet 4.5 and GPT-5.2) for co-evolving textual Domain-Specific Languages (DSLs) and their instances when grammars change, addressing the limitations of traditional model-driven engineering techniques in preserving human-relevant information. The study systematically evaluates the correctness and information preservation capabilities of these LLMs across ten case languages and multiple runs, varying the scale and complexity of the grammar evolutions. Results indicate high performance on small-scale instances but a significant performance degradation with increasing instance size and grammar evolution complexity, highlighting current limitations in LLM-based co-evolution for larger and more complex DSLs.

Systematically evaluates the capabilities of LLMs, specifically Claude Sonnet 4.5 and GPT-5.2, in co-evolving textual DSL grammars and instances, quantifying their performance with respect to correctness, information preservation, and scalability.

Weixing Zhang, A. Koziolek, Regina Hebig +12602.11904

Code Generation & Program SynthesisEval Frameworks & BenchmarksNatural Language Processing

2d ago

Leveraging Language Models to Discover Evidence-Based Actions for OSS Sustainability

The paper introduces a RAG-pipeline and two-layer prompting strategy to extract actionable recommendations (ReACTs) for improving OSS sustainability from software engineering literature. They systematically explore open LLMs and prompting techniques to derive candidate ReACTs from ICSE and FSE papers, followed by a filtering and refinement stage to ensure quality and extract supporting evidence. The pipeline generates 1,922 ReACTs, with 1,312 meeting strict quality criteria, providing a structured and scalable approach to translate research findings into practical guidance for OSS projects.

Introduces a novel RAG-pipeline leveraging LLMs to extract and structure evidence-based, actionable recommendations (ReACTs) from software engineering literature for improving OSS project sustainability.

Vladimir Filkov2602.11746

Natural Language ProcessingCode Generation & Program SynthesisOpen-Source Models & WeightsRecommendation & Information Retrieval

2d ago

LLM-Driven 3D Scene Generation of Agricultural Simulation Environments

This paper introduces a modular multi-LLM pipeline for generating agricultural simulation environments in Unreal Engine from natural language prompts, addressing limitations of existing LLM-based 3D scene generation approaches. The pipeline incorporates 3D asset retrieval, domain knowledge injection, and code generation, enhanced by LLM optimization techniques like few-shot prompting, RAG, and finetuning. Experiments demonstrate the system's effectiveness in creating realistic and semantically accurate agricultural environments, offering significant time savings compared to manual design.

Introduces a modular, multi-LLM pipeline that integrates 3D asset retrieval, domain knowledge injection, and code generation to create realistic agricultural simulation environments from natural language prompts.

Arafa Yoncalik, W. Jansen, Nico Huebel +22602.11706

World Models & PlanningCode Generation & Program SynthesisMultimodal Models

2d ago

Verifiable Provenance of Software Artifacts with Zero-Knowledge Compilation

This paper introduces zk-compilation, a novel approach to verifiable software provenance by executing a compiler within a zero-knowledge virtual machine (zkVM). This method generates both the compiled output and a cryptographic proof that the compilation was performed on the claimed source code with the specified compiler. The authors demonstrate the feasibility of zk-compilation using the RISC Zero zkVM and the ChibiCC C compiler, evaluating it on synthetic programs, OpenSSL, and libsodium source files, showing strong security guarantees against various attacks.

Introduces and demonstrates zk-compilation, a novel method for verifiable software provenance using zero-knowledge virtual machines.

Javier Ron, Martin Monperrus2602.11887

Code Generation & Program SynthesisOpen-Source Models & Weights

US Booking Services Ltd.2d ago

Automated Test Suite Enhancement Using Large Language Models with Few-shot Prompting

This paper investigates the impact of few-shot prompting on the quality of LLM-generated unit tests, exploring different sources of test artifacts (human, SBST, LLM) as examples. The study evaluates the generated tests based on correctness, coverage, readability, cognitive complexity, and maintainability using GPT-4o on HumanEval and ClassEval datasets. Results demonstrate that few-shot prompting enhances test quality, with human-written examples leading to the highest coverage and correctness, and that similarity-based example retrieval further improves prompt effectiveness.

Demonstrates that few-shot prompting with human-written test examples significantly improves the quality of LLM-generated unit tests, particularly in terms of coverage and correctness, and that example retrieval based on combined problem description and code similarity optimizes prompt effectiveness.

Alex Chudic, Gul cCalikli US Booking Services Ltd., U. Glasgow2602.12256

Code Generation & Program SynthesisEval Frameworks & BenchmarksNatural Language Processing

Feb 11, 2026

3d ago

GameDevBench: Evaluating Agentic Capabilities Through Game Development

The paper introduces GameDevBench, a new benchmark for evaluating multimodal agents in game development, a domain requiring complex code manipulation and multimodal asset handling. The benchmark comprises 132 tasks derived from tutorials, demanding significantly more code and file changes compared to existing software development benchmarks. Experiments reveal that current agents struggle with game development tasks, particularly those involving 2D graphics, but performance can be improved by incorporating image and video-based feedback mechanisms.

Introduces GameDevBench, a novel benchmark designed to evaluate and advance multimodal agents in the challenging domain of game development.

Wayne Chi, Arnav Yayavaram, Siddharth Yayavaram +62602.11103

Eval Frameworks & BenchmarksCode Generation & Program SynthesisTool Use & Agents

Feb 10, 2026

4d ago

Beyond Syntax: The Paradigm Shift to System Architecture in Large Language Model (LLM) Driven Development

This paper examines the shift in software engineering roles due to LLMs' code generation capabilities, arguing that system architecture is becoming the primary unit of engineering value. It uses case studies from the development of two systems, *Gaari* and *The Trail*, to illustrate how the engineering bottleneck is moving from syntax to system design. The paper concludes that modern engineers must transition to a "System Architect" model focused on logic and architecture.

Argues that the core engineering value in LLM-driven development is shifting from syntax to system architecture, requiring engineers to adopt a "System Architect" mindset.

Rizwanul Islam Afraim

Code Generation & Program SynthesisArchitecture Design (Transformers, SSMs, MoE)Tool Use & Agents

Feb 9, 2026

5d ago

Dreaming in Code for Curriculum Learning in Open-Ended Worlds

The paper introduces Dreaming in Code (DiCode), a framework that uses foundation models to generate executable environment code variations for curriculum learning in open-ended environments. DiCode addresses the challenge of discovering learnable sequences of experiences in complex environments by "dreaming" code-level variations of the world to scaffold learning. Experiments in the Craftax environment demonstrate that DiCode enables agents to acquire long-horizon skills, achieving a 16% improvement in mean return over the strongest baseline and success on late-game combat tasks where prior methods fail.

Introduces DiCode, a novel framework leveraging foundation models to synthesize executable environment code for curriculum learning, enabling agents to acquire complex skills in open-ended environments.

Konstantinos Mitsides, Maxence Faldor, Antoine Cully2602.08194

Code Generation & Program SynthesisTool Use & AgentsWorld Models & Planning

Feb 4, 2026

1w ago

Scaling Agentic Verifier for Competitive Coding

The paper introduces Agentic Verifier, a novel execution-based agent designed to improve the accuracy of LLMs on competitive programming tasks by actively generating discriminative test inputs to expose behavioral discrepancies among candidate solutions. This is achieved through multi-turn interaction with code execution environments, iteratively refining input generation using targeted counterexamples rather than random sampling. The agent is trained using a pipeline combining large-scale data synthesis, rejection fine-tuning, and agentic reinforcement learning, resulting in significant accuracy improvements (up to +10-15% in Best@K) across five competitive programming benchmarks compared to existing execution-based re-ranking methods.

Introduces an agentic verifier that actively generates discriminative test inputs to expose errors in candidate code solutions, significantly improving performance on competitive programming tasks.

Zeyao Ma, Jing Zhang, Xiaokang Zhang +92602.04254

Reasoning & Chain-of-ThoughtCode Generation & Program Synthesis

1w ago

Reducing the Costs of Proof Synthesis on Rust Systems by Scaling Up a Seed Training Set

This paper introduces VeruSyn, a data synthesis pipeline for generating a large-scale dataset of Verus-verified Rust programs to improve code-proof generation using LLMs. VeruSyn employs self-synthesis, tutorial-based synthesis, and agent trajectory synthesis to create a dataset of 6.9 million Rust programs with formal specifications and proofs. Fine-tuning a Qwen2.5-Coder-32B-Instruct model on this dataset achieves a better cost-proof tradeoff than state-of-the-art commercial models and outperforms existing research models.

Introduces VeruSyn, a novel data synthesis pipeline that generates a large-scale dataset of Verus-verified Rust programs, significantly improving the performance of LLMs in code-proof generation.

Nongyu Di, Tianyu Chen, Shan Lu +62602.04910

Reasoning & Chain-of-ThoughtCode Generation & Program SynthesisArchitecture Design (Transformers, SSMs, MoE)

Feb 1, 2026

Thai Nguyen University of1w ago

Enhancing Autonomous GIS with DeepSeek-Coder: an open-source large language model approach

This paper introduces LLM-Geo, a framework integrating the open-source DeepSeek-Coder model (specifically the 1.3B parameter version) into a GIS platform called DS-GeoAI, to address limitations of commercial LLM-based GIS solutions. The framework aims to reduce costs and increase accessibility by eliminating API dependencies and enabling local deployment. The DS-GeoAI platform achieves 90% accuracy in generating Python code for spatial analysis tasks after automated debugging, demonstrating comparable performance to commercial solutions with significantly lower operational costs.

Demonstrates the feasibility of using a lightweight, open-source LLM like DeepSeek-Coder for complex spatial analysis tasks within a GIS framework, achieving high accuracy and significant cost reduction compared to API-based commercial solutions.

Kim-Son Nguyen, The-Vinh Nguyen, Van-Viet Nguyen +3

Reasoning & Chain-of-ThoughtCode Generation & Program SynthesisArchitecture Design (Transformers, SSMs, MoE)

Jan 28, 2026

2w ago

SERA: Soft-Verified Efficient Repository Agents

The paper introduces Soft-Verified Efficient Repository Agents (SERA), a supervised finetuning method for efficiently training coding agents specialized to private codebases. SERA leverages Soft Verified Generation (SVG) to create thousands of synthetic trajectories from a single repository, enabling rapid and cost-effective specialization. The resulting SERA models achieve state-of-the-art performance among fully open-source models, matching the performance of models like Devstral-Small-2 at a fraction of the cost compared to reinforcement learning or previous synthetic data methods.

Introduces Soft Verified Generation (SVG), a novel method for generating synthetic code trajectories that enables efficient supervised finetuning of coding agents specialized to private codebases.

Ethan Shen, Danny Tormoen, Saurabh Shah +22601.20789

Code Generation & Program SynthesisTraining Efficiency & OptimizationOpen-Source Models & Weights

Jan 20, 2026

3w ago

Report for NSF Workshop on AI for Electronic Design Automation

This report summarizes discussions and recommendations from the NSF Workshop on AI for Electronic Design Automation, which explored the application of various AI techniques like LLMs, GNNs, and RL to EDA tasks. The workshop identified key challenges and opportunities across physical synthesis, high-level synthesis, optimization, and verification. The report advocates for NSF investment in AI/EDA collaboration, foundational AI research, data infrastructure, scalable compute, and workforce development to advance hardware design.

Synthesizes expert perspectives and recommendations on leveraging AI to address critical challenges in electronic design automation.

Deming Chen, Vijay Ganesh, Weikai Li +72601.14541

Reasoning & Chain-of-ThoughtCode Generation & Program SynthesisArchitecture Design (Transformers, SSMs, MoE)

Jan 19, 2026

3w ago

A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

This paper introduces a two-stage GPU kernel tuner that combines LLM-based semantic refactoring into parameterizable templates with search-based autotuning of these parameters. By explicitly representing optimization choices as template parameters, the approach enables more controlled and systematic exploration of the optimization space compared to direct code rewriting. Experiments on CUDA kernels extracted from SGLang demonstrate speedups exceeding 3x, highlighting the effectiveness of the template-plus-search design.

Introduces a two-stage GPU kernel tuning approach that combines LLM-based semantic refactoring with search-based autotuning to achieve more stable and higher-quality speedups compared to agent-only direct rewriting.

Qiuyi Qu, Yicheng Sui, Yufei Sun +62601.12698

Code Generation & Program SynthesisArchitecture Design (Transformers, SSMs, MoE)Inference & Quantization

Jan 15, 2026

Autonomous Quantum Simulation through Large Language Model Agents

The paper demonstrates that LLM agents can autonomously perform tensor network simulations of quantum many-body systems, a task requiring significant human expertise. They achieve this by combining in-context learning with curated documentation and a multi-agent decomposition approach, training the agents in specialized computational domains. Benchmarking on quantum phase transitions, open quantum system dynamics, and photochemical reactions shows a ~90% success rate, with the multi-agent architecture significantly reducing implementation errors and hallucinations compared to single-agent baselines.

Shows that LLM agents can autonomously perform complex tensor network simulations of quantum many-body systems with high success rates.

Weitang Li, Jiajun Ren, Lixue Cheng +12601.10194

Reasoning & Chain-of-ThoughtCode Generation & Program SynthesisRobotics & Embodied AI

Jan 14, 2026

DScheLLM: Enabling Dynamic Scheduling through a Fine-Tuned Dual-System Large language Model

The paper introduces DScheLLM, a dynamic scheduling approach using a fine-tuned Huawei OpenPangu Embedded-7B large language model within a dual-system (fast-slow) reasoning architecture to address disruptions in job shop scheduling. The model is trained on datasets generated from exact schedules obtained via an operations research solver, enabling it to handle dynamic events effectively. Experiments on standard benchmarks demonstrate the fast-thinking mode generates high-quality schedules efficiently, while the slow-thinking mode produces solver-compatible decision inputs.

Introduces a novel dual-system (fast-slow) reasoning architecture leveraging fine-tuned LLMs for dynamic job shop scheduling, demonstrating adaptability to unforeseen disturbances.

Lixiang Zhang, Chenggong Zhao, Qing Gao +32601.09100

Reasoning & Chain-of-ThoughtCode Generation & Program Synthesis

Jan 13, 2026

University of Mosul Department of Tourism StudiesJan 13, 2026

Developing a knowledge management-large language model (KM-LLM) application in higher education: a UTAUT2 perspective

This paper presents the development of KM-LLM, a generative AI tool leveraging retrieval-augmented generation (RAG) and GPT-4o to improve knowledge management processes in Iraqi higher education institutions. The study investigates the acceptance of KM-LLM by academics using the UTAUT2 framework through a survey of 10,321 academics. Results indicate the potential of KM-LLM to enhance KM processes and identify key UTAUT2 constructs influencing the intention to adopt the application.

Develops and evaluates KM-LLM, a RAG-based application using GPT-4o, for knowledge management in Iraqi higher education, providing empirical evidence on its acceptance by academics.

Osama Mohammed Ahmed Al Atraqchi, Amir A. Abdulmuhsin, S. Rehman +1

Reasoning & Chain-of-ThoughtCode Generation & Program Synthesis

Jan 9, 2026

STELP: Secure Transpilation and Execution of LLM-Generated Programs

The paper introduces STELP, a Secure Transpiler and Executor of LLM-Generated Programs, to address the safety and reliability issues associated with directly executing code generated by Large Language Models in production systems. STELP operates by transpiling LLM-generated code into a safer, controlled environment, mitigating vulnerabilities such as data poisoning and malicious attacks. The authors demonstrate STELP's effectiveness through benchmarks on correctness, safety, and latency, showing it outperforms existing methods in safely executing risky code snippets using a newly created human-validated dataset of insecure code.

Introduces STELP, a novel system for secure transpilation and execution of LLM-generated code, enhancing safety and reliability in production environments.

Swapnil Shinde, Sahil Wadhwa, Andy Luo +22601.05467

Code Generation & Program SynthesisRed-Teaming & Adversarial RobustnessTool Use & Agents

Jan 6, 2026

From Memorization to Creativity: LLM as a Designer of Novel Neural-Architectures

This paper explores using LLMs for neural architecture search by placing a code-oriented LLM in a closed-loop synthesis framework with iterative fine-tuning based on performance feedback and novelty filtering. The LLM synthesizes PyTorch convolutional networks, which are validated, evaluated on single-epoch accuracy, and filtered for structural redundancy using MinHash-Jaccard. Results show the LLM internalizes architectural priors, improving the valid generation rate and accuracy, and synthesizing novel, high-performing architectures not present in the original training data.

Demonstrates that LLMs can be fine-tuned using execution feedback to autonomously design novel and high-performing neural architectures, moving beyond memorization of existing designs.

Waleed Khalid, Dmitry Ignatov, Radu Timofte2601.02997

Code Generation & Program SynthesisArchitecture Design (Transformers, SSMs, MoE)

2026

Jan 1, 2026

Large language model (LLM)-based agentic artificial intelligence tool streamlines research processes in biomarker studies: a proof of concept

This study developed and evaluated an agentic AI tool leveraging LLMs and Retrieval-Augmented Generation (RAG) to automate full-text screening of publications for a systematic review of circulating biomarkers in heart failure. The tool decomposed inclusion/exclusion criteria into 136 tasks, assigned to individual LLM agents, and used a critique LLM for validation. Results showed the AI tool achieved a sensitivity of 91% and specificity of 53% in the validation phase, with greater inter-rater agreement (κ = 0.38) compared to human reviewers (κ = 0.23).

Demonstrates an agentic LLM-based AI tool's ability to automate full-text screening in systematic reviews, achieving high sensitivity and outperforming human reviewers in consistency.

Y. Ye, M. Colombo, J. Meessen +7

Reasoning & Chain-of-ThoughtCode Generation & Program Synthesis

Dec 26, 2025

Agent2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback

The paper introduces Agent2World, a multi-agent framework for generating symbolic world models by leveraging web searching, model implementation, and adaptive unit testing. This framework grounds world model generation in multi-agent feedback, addressing the limitations of static validation methods. Fine-tuning the model with trajectories generated by the interactive testing environment leads to a substantial improvement in world-model generation, achieving a 30.95% relative gain.

Introduces a novel multi-agent framework, Agent2World, that leverages adaptive feedback from a testing team to improve the generation of symbolic world models.

Mengkang Hu, Bowei Xia, Yuran Wu +92512.22336

Tool Use & AgentsWorld Models & PlanningCode Generation & Program Synthesis

Dec 11, 2025

PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code

The paper introduces PACIFIC, a framework for automatically generating benchmarks to evaluate LLMs' ability to follow instructions and dry-run code. PACIFIC generates benchmark variants with precise expected outputs, enabling reliable evaluation by comparing predicted and expected outputs, focusing on the LLM's intrinsic reasoning ability without relying on external tools or agentic behavior. Experiments using PACIFIC on state-of-the-art LLMs demonstrate its ability to create benchmarks of varying difficulty that effectively differentiate instruction-following and dry-running capabilities while mitigating training data contamination.

Introduces PACIFIC, a novel benchmark generation framework that isolates and evaluates LLMs' intrinsic instruction-following and code dry-running abilities, offering a scalable and contamination-resilient methodology.

Itay Dreyfuss, Antonio Abu Nassar, Samuel Ackerman +52512.10713

Eval Frameworks & BenchmarksCode Generation & Program Synthesis

Nov 10, 2025

RTLBench: A Multi-Dimensional Benchmark Suite for Evaluating LLM-Generated RTL Code

The authors introduce RTLBench, a multi-dimensional benchmark suite for evaluating LLM-generated RTL code across syntax, functionality, lint compliance, readability, and style consistency, using 160 cases from textbooks and open-source projects. They evaluate 24 state-of-the-art LLMs, revealing that while syntax and functionality are reasonably addressed, engineering quality aspects are often lacking. To improve LLM-generated RTL, they propose Log2BetterRTL, a log-driven feedback system that leverages EDA tool diagnostics to iteratively refine the code, demonstrating significant improvements in various quality metrics.

Introduces RTLBench, a comprehensive benchmark suite with a multi-dimensional evaluation framework, to assess and improve the quality of LLM-generated RTL code beyond syntax and functionality.

Zhigang Fang, Renzhi Chen, Yang Guo +2

Eval Frameworks & BenchmarksCode Generation & Program Synthesis

Nov 8, 2025

Estimating Agile Effort through Merge Request Analytics with an Explainable T-Shirt Sizing Model

The paper introduces MR-Size, an explainable effort estimator that predicts T-shirt sizes and interpolated day estimates for GitLab merge requests by computing a composite complexity score based on code diffs, file weights, contributor dynamics, and semantic signals. MR-Size achieves a Pearson correlation of 0.79 and a mean absolute error of 2.34 days across 150 merge requests, matching LOC baselines while providing per-file explanations. The method, datasets, evaluation plan, and reproducibility artifacts are described, with a benchmarking protocol comparing MR-Size against LOC baselines, COCOMO-style models, and learned regressors.

Introduces an explainable, repository-driven method (MR-Size) for estimating agile effort by mapping merge requests to T-shirt sizes using a composite complexity score derived from code diffs, file weights, contributor dynamics, and semantic contextual signals.

Muthukrishnan Thukkaram

Code Generation & Program SynthesisNatural Language Processing

Nov 5, 2025

ITMO UniversityNov 5, 2025

Open-Source Large Language Model Frameworks for Automated Penetration Testing: Opportunities, Challenges, and Solutions

This paper investigates the applicability of open-source LLM frameworks, including both large-scale and lightweight models, for automating penetration testing tasks relevant to commercial security assessments. The study identifies both the potential and limitations of these frameworks in addressing fundamental challenges in penetration testing. The authors propose a practical approach to overcome key limitations and demonstrate the potential of LLM-based frameworks in real-world penetration testing scenarios.

Demonstrates the practical application of open-source LLM frameworks for penetration testing, highlighting their capabilities and limitations, and proposes solutions to address identified challenges.

Nikolai Eritenko, Alexander Menshchikov, Danil Sviridov +2

Red-Teaming & Adversarial RobustnessCode Generation & Program SynthesisOpen-Source Models & Weights

Selcuk UniversityNov 5, 2025

Compressing Large Language Models for SQL Injection Detection: A Case Study on Deep Seek-Coder and Meta-llama-3-70b-instruct

This paper benchmarks the performance of Deep Seek Coder and Meta-llama-3-70b-instruct in detecting SQL injection vulnerabilities using a labeled dataset of malicious and legitimate SQL queries. The evaluation focuses on Boolean-based attacks and measures precision, recall, F1-score, and accuracy. Meta-llama-3-70b-instruct achieved superior recall and overall accuracy (74.00%) compared to Deep Seek Coder (60.00%), suggesting it is better at detecting a wider range of malicious queries, though both models require further refinement for standalone security analysis.

Quantifies and compares the effectiveness of Deep Seek Coder and Meta-llama-3-70b-instruct in identifying SQL injection vulnerabilities, revealing the strengths and weaknesses of each model.

Borhanullah Hairan, M. A. Şahman

Code Generation & Program SynthesisRed-Teaming & Adversarial RobustnessInference & Quantization

Nov 2, 2025

Michigan Technological UniversityNov 2, 2025

GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents

The paper introduces GrowthHacker, a benchmark and framework for optimizing off-policy evaluation (OPE) using code-modifying LLM agents, addressing the limitations of online A/B testing. They developed a two-agent framework within GrowthHacker that iteratively optimizes OPE code, evaluates the results, and initiates new optimization cycles using real-world datasets from Open Bandit Pipeline and Scope-RL. Experiments demonstrate that the two-agent framework achieves 100% reliability and a 106.7% average improvement in OPE performance, outperforming other LLM agent-based approaches.

Demonstrates the feasibility and effectiveness of using code-modifying LLM agents to automatically optimize off-policy evaluation, achieving significant performance improvements over baseline methods.

Jie JW Wu, Ayanda Patrick Herlihy, Ahmad Saleem Mirza +22511.00802

Code Generation & Program SynthesisTool Use & AgentsEval Frameworks & Benchmarks

Oct 17, 2025

Repairing Tool Calls Using Post-tool Execution Reflection and RAG

This paper introduces a post-tool execution reflection mechanism that leverages LLM-based reflection and domain-specific RAG to repair failed tool calls in agentic systems. The approach uses a combination of tool-specific documentation and troubleshooting documents to identify and correct both syntactic and semantic errors that are only apparent after the tool's response is analyzed. Experiments using the kubectl command-line tool for Kubernetes management demonstrate that the RAG-based reflection improves the execution pass rate by 55% and the correctness of answers to user queries by 36% on average, with troubleshooting documents outperforming official documentation.

Introduces a novel post-tool execution reflection component that combines LLM-based reflection with domain-specific RAG to improve the reliability and accuracy of tool calls in agentic systems.

Jason Tsay, Zidane Wright, Gaodan Fang +32510.17874

Tool Use & AgentsCode Generation & Program SynthesisRecommendation & Information Retrieval

Oct 17, 2025

Blueprint2Code: a multi-agent pipeline for reliable code generation via blueprint planning and repair

The paper introduces Blueprint2Code, a multi-agent framework designed to improve code generation by mimicking the human programming workflow through task comprehension, planning, implementation, and iterative refinement. This framework utilizes four interacting agents—Previewing, Blueprint, Coding, and Debugging—to address the limitations of LLMs in complex programming tasks requiring multi-step reasoning and reliable code generation. Experiments on HumanEval, MBPP, and APPS datasets demonstrate that Blueprint2Code achieves state-of-the-art pass@1 results, significantly outperforming existing methods, especially on extended and more challenging versions of the benchmarks.

Introduces a novel multi-agent framework, Blueprint2Code, that decomposes code generation into distinct stages handled by specialized agents to improve performance on complex programming tasks.

Kehao Mao, Baokun Hu, Ruixin Lin +3

Code Generation & Program SynthesisTool Use & AgentsReasoning & Chain-of-Thought

Lattice is designed for desktop

Code Generation & Program Synthesis

Keywords

Top Labs in This Topic

Recent Papers