April 24 – May 1, 2026

Code Generation & Program Synthesis - Weekly Roundup

100 papers published across 3 labs.

Selected Labs publishing this week

Tsinghua AI2 Amazon Science1 Stanford HAI1

Top Papers

Apr 30, 2026

Michael Hanus +33w ago·also Kiel University

A Monadic Implementation of Functional Logic Programs

Functional logic programs can be efficiently implemented in purely functional languages like Haskell, achieving performance gains over existing Curry compilers by using a novel monadic interface with memoization.

Michael Hanus, M. Hanus, Kai-Oliver Prott +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Natural Language Processing

May 1, 2026

Daniel Song +233w ago

Code World Model Preparedness Report

Meta's risk assessment of its Code World Model (CWM) gives it a clean bill of health, concluding it poses no *new* catastrophic risks beyond those already present in the AI landscape.

Daniel Song, Peter Ney, Cristina Menghini +21

Code Generation & Program Synthesis Open-Source Models & Weights Red-Teaming & Adversarial Robustness

Indraneil Paul +33w ago

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Current code reward models are myopic, mostly rewarding functional correctness, but Themis-RM learns to score code across multiple criteria and languages, opening the door to more nuanced and useful code generation.

Indraneil Paul, Glavaš Glavas, Glavavs Glavas +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks RLHF & Preference Learning

Massimo Rondelli +23w ago

BlenderRAG: High-Fidelity 3D Object Generation via Retrieval-Augmented Code Synthesis

LLMs can now generate 70% syntactically correct and geometrically consistent 3D objects from text, thanks to retrieval-augmented code synthesis.

Massimo Rondelli, Francesco Pivi, Maurizio Gabbrielli

Code Generation & Program Synthesis Multimodal Models Recommendation & Information Retrieval

Apr 30, 2026

Friedrich Schiller University3w ago

Beyond the Training Distribution: Mapping Generalization Boundaries in Neural Program Synthesis

Transformers struggle to extrapolate to syntactically novel programs in program synthesis, even with significant compute scaling, suggesting current approaches are bottlenecked by a lack of training diversity.

Henrik Voigt, Michael Habeck, Joachim Giesen

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

All Papers (100)

May 1, 2026

Daniel Song +233w ago

Code World Model Preparedness Report

Meta's risk assessment of its Code World Model (CWM) gives it a clean bill of health, concluding it poses no *new* catastrophic risks beyond those already present in the AI landscape.

Daniel Song, Peter Ney, Cristina Menghini +21

Code Generation & Program Synthesis Open-Source Models & Weights Red-Teaming & Adversarial Robustness

Indraneil Paul +33w ago

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Indraneil Paul, Glavaš Glavas, Glavavs Glavas +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks RLHF & Preference Learning

Massimo Rondelli +23w ago

BlenderRAG: High-Fidelity 3D Object Generation via Retrieval-Augmented Code Synthesis

LLMs can now generate 70% syntactically correct and geometrically consistent 3D objects from text, thanks to retrieval-augmented code synthesis.

Massimo Rondelli, Francesco Pivi, Maurizio Gabbrielli

Code Generation & Program Synthesis Multimodal Models Recommendation & Information Retrieval

Apr 30, 2026

Friedrich Schiller University3w ago

Beyond the Training Distribution: Mapping Generalization Boundaries in Neural Program Synthesis

Henrik Voigt, Michael Habeck, Joachim Giesen

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Zainab Rehan +73w ago

Towards Neuro-symbolic Causal Rule Synthesis, Verification, and Evaluation Grounded in Legal and Safety Principles

LLMs can synthesize formal safety rules from natural language goals, offering a path to more robust and verifiable AI systems in safety-critical domains.

Zainab Rehan, Zainab Rehan, Christian Medeiros Adriano +5

Code Generation & Program Synthesis Constitutional AI & AI Ethics Reasoning & Chain-of-Thought

Barcelona Supercomputing Center3w ago

RuC: HDL-Agnostic Rule Completion Benchmark Generation

LLMs struggle to complete RTL code, and their performance hinges on the grammatical structure of the missing code and the prompting strategy used.

Arnau Ayguadé Domingo, Arnau Ayguad'e Domingo, Miquel Alberti-Binimelis +7

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Yildiz Technical University3w ago

Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents

LLMs can learn to safely leverage external memory for code debugging by explicitly modeling and penalizing the risk of false-positive memory injection.

Mehmet Iscan, M. Işcan

Code Generation & Program Synthesis Recommendation & Information Retrieval Tool Use & Agents

3w ago

Understanding Bugs in Template Engine-Based Applications: Symptoms, Root Causes, and Fix Patterns

Template engine bugs often manifest as silent failures with unexpected or blank outputs, and fixing them frequently requires changes to host-side logic, not just the template itself.

Kai Gao, Yu Sun, Chang-Ai Sun

Code Generation & Program Synthesis Natural Language Processing

3w ago·also UMass

REBENCH: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary Types and Names (Extended Version)

LLMs still can't reliably reverse engineer stripped binaries, and REBench offers a standardized, fair-by-construction benchmark to finally measure progress.

Junsuh Won, Jun Yeon Won, Xin Jin +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

3w ago·also Milwaukee School of Engineering

Static Attribution of Android Residential Proxy Malware Using Graph Kernels

Achieve near-perfect attribution of Android residential proxy malware by fusing graph kernel features with binary capabilities, even amidst code reuse and obfuscation.

Peter Clark, P. Clark, Yong Guan +1

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness

Federal University of Ceará3w ago·also Federal University of Bahia

An Empirical Evaluation of Code Smell Detection in Angular Applications

Angular apps are riddled with hidden design flaws: this study surfaces 11 common "code smells" and shows how to automatically sniff them out.

Maykon Nunes, Emanuel Coutinho, E. Coutinho +2

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis

3w ago·also Osaka, Wakayama University

A Longitudinal Analysis of Good First Issue Practices and Newcomer Pull Requests in Popular OSS Projects

Newcomers beware: the odds of your "good first issue" pull request getting merged have plummeted nearly 20% in the last year.

Hirotatsu Hoshikawa, Hidetake Tanaka, Kazumasa Shimari +3

Code Generation & Program Synthesis Natural Language Processing Open-Source Models & Weights

Michael Hanus +33w ago·also Kiel University

A Monadic Implementation of Functional Logic Programs

Michael Hanus, M. Hanus, Kai-Oliver Prott +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Natural Language Processing

Jisheng Zhao +43w ago

CuLifter: Lifting GPU Binaries to Typed IR

Recovering type information from untyped GPU register files is the key to enabling effective binary analysis, unlocking reverse engineering and security analysis of proprietary GPU code.

Jisheng Zhao, Huanzhi Pu, Shinnung Jeong +2

Code Generation & Program Synthesis Distributed Systems & Hardware Inference & Quantization

Qiyao Wang +73w ago·also Introduction With the advancement of multimodal

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

Today's best multimodal agents still fall into "blind execution" traps when building websites from ambiguous, non-expert user instructions, highlighting a critical gap in intent recognition and adaptive interaction.

Qiyao Wang, Haoran Hu, Longze Chen +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Multimodal Models+1

DeepWisdom3w ago

ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

Forget learning to answer – ANCORA shows language models can master verifiable reasoning by learning to *question* themselves.

Cheng Yang, Chengcao Yang, Jun Chen

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Training Efficiency & Optimization

Lei Li +83w ago·also Ickylin AI Team

ChipLingo: A Systematic Training Framework for Large Language Models in EDA

Domain-adapting LLMs for EDA requires explicit RAG scenario training to prevent performance degradation, and QA augmentation during corpus construction further boosts performance.

Lei Li, Xing Yu, Xingwen Yu +6

Code Generation & Program Synthesis Recommendation & Information Retrieval Training Efficiency & Optimization

Taslim Jamal Arif +23w ago

Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems

Real-world Text-to-SQL systems can now be continuously evaluated and improved in production, even without access to database schemas or ground-truth queries.

Taslim Jamal Arif, Taslim Arif, Kuldeep Singh

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Ivan Bercovich +13w ago

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

Popular terminal-agent benchmarks are riddled with flaws, with over 15% of tasks being easily reward-hackable, undermining their ability to accurately assess LLM capabilities.

Ivan Bercovich, I. Bercovich

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

3w ago

Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding

Text-to-SQL models can get a 36% accuracy boost and run 2.2x faster by exploiting the predictable patterns in real-world query workloads.

Smit Jivani, Sarvam Maheshwari, Sunita Sarawagi

Code Generation & Program Synthesis Natural Language Processing

Shuo Jiang +13w ago

Design Structure Matrix Modularization with Large Language Models

Domain knowledge, usually helpful, can actually *hurt* LLMs tackling complex engineering design modularization, revealing a fundamental tension between semantic priors and structural optimization.

Shuo Jiang, Jianxi Luo

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Guang Yang +33w ago

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

MLLMs can ace circuit-to-code generation by cheating with identifier semantics, even when the circuit diagram is blank.

Guang Yang, Xing Hu, Xiang Chen +1

Code Generation & Program Synthesis Computer Vision Multimodal Models

Jackson Vonderhorst +53w ago·also Notre Dame

Exploring Interaction Paradigms for LLM Agents in Scientific Visualization

General-purpose coding agents may ace scientific visualization tasks, but their computational cost is a steep price compared to the efficiency of domain-specific agents, highlighting a crucial trade-off in LLM agent design.

Jackson Vonderhorst, Kuangshi Ai, H. Miao +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Adam Ishay +13w ago

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

LLMs can achieve robust nonmonotonic reasoning across diverse tasks without task-specific engineering, simply by iteratively self-correcting based on feedback from an ASP solver.

Adam Ishay, Joohyung Lee

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Zhuoran Pan +43w ago

Intent2Tx: Benchmarking LLMs for Translating Natural Language Intents into Ethereum Transactions

Despite advances in LLMs, even syntactically correct outputs often fail to achieve the intended state transitions when translating natural language into executable Ethereum transactions, revealing a critical gap in "reasoning-to-execution" capabilities.

Zhuoran Pan, Yue Li, Zhi Guan +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Zhongguancun Academy3w ago

AgentEconomist: An End-to-end Agentic System Translating Economic Intuitions into Executable Computational Experiments

Automating the translation of economic intuitions into executable computational experiments is now possible, potentially accelerating the pace of economic research.

Jiaju Chen, Jinghua Piao, J. Piao +5

Code Generation & Program Synthesis Scientific Discovery & Drug Design Tool Use & Agents

C. Meng +73w ago·also NYCU

HAVEN: Hybrid Automated Verification ENgine for UVM Testbench Synthesis with LLMs

LLMs can now reliably generate IC verification testbenches, not by writing HDL directly, but by orchestrating a novel hybrid approach that combines LLM-driven planning with template-based HDL generation.

C. Meng, Chang-Chih Meng, Yu-Ren Lu +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

3w ago

ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language Models

LLMs trained with ScaleBox, a new high-fidelity code verification system, substantially outperform those trained with heuristic matching, suggesting current RLHF methods are bottlenecked by verification quality.

Jiasheng Zheng, Xin Zheng, Boxi Cao +9

Code Generation & Program Synthesis Distributed Systems & Hardware Eval Frameworks & Benchmarks

Wei Cheng +63w ago

To Diff or Not to Diff? Structure-Aware and Adaptive Output Formats for Efficient LLM-based Code Editing

LLMs can edit code 30% faster and cheaper without sacrificing accuracy, simply by learning to choose between generating full code and structure-aware diffs.

Wei Cheng, Yongchang Cao, Chen Shen +4

Code Generation & Program Synthesis Inference & Quantization Training Efficiency & Optimization

China Telecom Research Institute3w ago

How Code Representation Shapes False-Positive Dynamics in Cross-Language LLM Vulnerability Detection

LLMs trained on raw code text learn surface-level cues that trigger false positives when detecting vulnerabilities in other languages, but simply feeding them ASTs at inference time can dramatically reduce these errors.

Maofei Chen, Laifu Wang, Yue Qin +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Open-Source Models & Weights

Zi Li +63w ago

Secret Stealing Attacks on Local LLM Fine-Tuning through Supply-Chain Model Code Backdoors

You can steal secrets from locally fine-tuned LLMs by backdooring their model code, even bypassing common defenses like differential privacy and code audits.

Zi Li, Tian Zhou, Tianyang Zhou +4

Code Generation & Program Synthesis Open-Source Models & Weights Red-Teaming & Adversarial Robustness

Md. Faizul Ibne Amin +53w ago

LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

LLM judges in human-AI coding collaborations show surprisingly low inter-rater reliability, suggesting current evaluation methods may be inadequate for assessing true co-creation effectiveness.

Md. Faizul Ibne Amin, Yutaka Watanobe, Daniel M. Muepu +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks RLHF & Preference Learning+1

3w ago·also Tsinghua AI, CAS, NJU, NTU

PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models

Code dataset watermarking gets a stealthy upgrade: PuzzleMark hides watermarks in variable names based on code complexity, making them nearly undetectable while guaranteeing perfect verification.

Haocheng Huang, Yuchen Chen, Weisong Sun +6

Code Generation & Program Synthesis Data Curation & Synthetic Data

Pedro-Aarón Hernández-Ávalos +33w ago·also Tecnologico de Monterrey

Pragmos: A Process Agentic Modeling System

Forget end-to-end automation: Pragmos shows how LLMs can actually *improve* business process modeling by collaborating with humans in a structured, step-by-step workflow.

Pedro-Aarón Hernández-Ávalos, Pedro-Aar'on Hern'andez-'Avalos, Luciano Garc'ia-Banuelos +1

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Federal University of Ceara3w ago·also Federal University of Alagoas, Federal University of Bahia, Federal University of Ceará

Beyond Code, We Are People: A Systematic Mapping of 25 Years of Literature on Soft Skills in Agile Development Teams

Turns out, even in the age of AI, good old-fashioned communication and teamwork are still the bedrock of successful agile software development.

Israely Lima, I. Lima, Lucas Moura Lourencco +7

Code Generation & Program Synthesis Natural Language Processing

Rochester Institute of Technology3w ago

Unsafe and Unused? A History of Utility Code in Mature Open Source Projects

"Utility" code, intended to be broadly useful and reusable, is actually 2.75x more likely to be involved in a vulnerability than other code.

Brandon Keller, Brandon N. Keller, Kaitlin Yandik +5

Code Generation & Program Synthesis Open-Source Models & Weights

Amazon Science3w ago

One Size Fits All? An Empirical Comparison of ADR Templates regarding Comprehension, Usability, and Ease of Adoption

Turns out, the best template for documenting architectural decisions depends on whether you value conciseness (Nygard) or structural detail (MADR).

Fernando Nogueira, F. Nogueira, Nabson Silva +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Natural Language Processing

IIIT Allahabad3w ago·also IIIT Hyderabad, IIIT Manipur

Multifaceted Hero Developers and Bug-Fixing Outcomes Across Severity

Defining "hero developers" in open-source projects is more nuanced than previously thought: technical prowess doesn't guarantee social engagement, and vice versa, impacting bug-fixing success in surprising ways.

Amit Kumar, Mahen Gandhi, Meher Bhardwaj +2

Code Generation & Program Synthesis Natural Language Processing Open-Source Models & Weights

Blekinge Institute of Technology3w ago

GenAI in Software Engineering: The Role of Technology Acceptance Models

Applying traditional technology acceptance models like UTAUT to GenAI reveals critical gaps in our understanding of how software engineers perceive and adopt these transformative tools.

O. Johansson, Oscar Johansson, Jürgen Börstler +2

Code Generation & Program Synthesis Constitutional AI & AI Ethics

3w ago·also SMU

Tail-aware N-version Machine Learning Models for Reliable API Recommendation

Mitigating long-tail distributions in code datasets boosts API recommendation reliability by up to 10% using an ensemble of models that strategically reject low-confidence predictions.

Aoi Matsuda, Fumio Machida, David Lo

Code Generation & Program Synthesis Recommendation & Information Retrieval

Apr 29, 2026

3w ago

FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

Automating CUTLASS kernel synthesis and auto-tuning lets you get 2.79x speedups on real models like MiniGPT just by having an LLM rewrite your PyTorch.

Sina Heidari, Dimitrios S. Nikolopoulos

Code Generation & Program Synthesis Tool Use & Agents Training Efficiency & Optimization

Gilberto Sussumu Hida +23w ago

Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs

LLMs don't automatically win at study screening for software engineering SLRs: their performance is highly variable, sensitive to input data, and not consistently better than classical models.

Gilberto Sussumu Hida, Danilo Monteiro Ribeiro, Erika Yahata

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Ermanno Francesco Sannini +63w ago

Identifying and Characterizing Semantic Clones of Solidity Functions

LLMs can help find functionally identical smart contracts even when the original code lacks comments, opening the door to better vulnerability detection and code reuse.

Ermanno Francesco Sannini, Francesco Salzano, Simone Scalabrino +4

Code Generation & Program Synthesis

3w ago

Probabilistic Condition, Decision and Path Coverage of Circuit-based Quantum Programs

Quantum programs can achieve seemingly high structural coverage, yet this bears little relation to their actual fault detection capability, echoing a cautionary tale from classical software testing.

Daniel Fortunato, José Campos, Rui Abreu

Code Generation & Program Synthesis

University of Hildesheim3w ago

Reproducible Automated Program Repair Is Hard -- Experiences With the Defects4J Dataset

Reproducibility issues plague over 20% of Defects4J, a widely used benchmark for automated program repair, casting doubt on the validity of many APR evaluations.

Adam Krafczyk, Klaus Schmid

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Open-Source Models & Weights

Ericsson AB3w ago·also KTH

Where did we fail? -- Reproducing build failures in embedded open source software

Replaying CI failures in embedded systems is now possible at scale: PhantomRun reconstructs over 90% of failing builds, opening the door to systematic debugging and failure analysis.

Han Fu, Andreas Ermedahl, Sigrid Eldh +3

Code Generation & Program Synthesis Distributed Systems & Hardware Open-Source Models & Weights

QUT3w ago·also Edith Cowan University, Research Graduate School

eDySec: A Deep Learning-based Explainable Dynamic Analysis Framework for Detecting Malicious Packages in PyPI Ecosystem

You can slash false positives in PyPI malware detection by 82% while simultaneously reducing feature dimensionality by 50% using a carefully tuned deep learning approach.

Sk Tanzir Mehedi, Raja Jurdak, Chadni Islam +2

Code Generation & Program Synthesis Open-Source Models & Weights

Mississippi State University Starkville3w ago

Adaptive and AI-Augmented Security Testing: A Systematic Survey of Program Analysis, Feedback-Driven Testing, and Hybrid Learning-Based Approaches

Security testing is fragmented: program analysis and adaptive testing operate largely in isolation, missing opportunities to leverage structural insights for more effective vulnerability detection.

Michael Wienczkowski

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness

3w ago·also KCL, Universidad del Atlantico Medio

Hot Fixing in the Wild

AI agents and humans exhibit over 10 distinct repair behaviors when performing urgent hot fixes, suggesting opportunities for targeted human-automation collaboration.

Carol Hanna, Karine Even-Mendoza, W. B. Langdon +3

Code Generation & Program Synthesis Open-Source Models & Weights

Barcelona Supercomputing Center (BSC)3w ago

A Test Taxonomy and Continuous Integration Ecosystem for Dynamic Resource Management in HPC

Automated testing of dynamic resource management frameworks in HPC is now possible, catching faults earlier and simplifying maintenance.

Petter Sandås, Íñigo Aréjula-Aísa

Code Generation & Program Synthesis Distributed Systems & Hardware Eval Frameworks & Benchmarks

Fei Bai +153w ago·also IQuest Research, RUC

ClawGym: A Scalable Framework for Building Effective Claw Agents

Building agents that can reliably automate complex, multi-step workflows over local files and tools just got a whole lot easier.

Fei Bai, Huatong Song, Shuang Sun +13

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks+1

Rafael Mayo +13w ago

DMRlib: Easy-coding and Efficient Resource Management for Job Malleability

Unlock 3x higher throughput in your data center by easily converting MPI applications to malleable jobs with a new library.

Rafael Mayo, Enrique S. Quintana-Ortí

Code Generation & Program Synthesis Distributed Systems & Hardware Training Efficiency & Optimization

3w ago·also Chongqing, SJTU

ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

LLMs still struggle to generate complete, internally structured classes from specifications, with even the best models failing more than half the time on a new benchmark designed to avoid data contamination.

Chaoxiang Xie, Yuling Shi, Wenhao Zeng +3

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

M. K. Khalidi Siam +73w ago

Exploring the Limits of Pruning: Task-Specific Neurons, Model Collapse, and Recovery in Task-Specific Large Language Models

Task-specific LLMs aren't just smaller versions of general models; they rely on a small subset of neurons so critical that removing just 10% can completely break them.

M. K. Khalidi Siam, Md. Tausif-Ul-Islam, Md. Reshad Romim Khan +5

Code Generation & Program Synthesis Inference & Quantization Reasoning & Chain-of-Thought

Frank Ginac3w ago

Cognitive Atrophy and Systemic Collapse in AI-Dependent Software Engineering

Over-reliance on AI code generation isn't just making developers lazy, it's creating a dangerous "Epistemological Debt" that could trigger systemic software failures.

Frank Ginac

Code Generation & Program Synthesis Constitutional AI & AI Ethics Tool Use & Agents

Media University Stuttgart3w ago

The Buy-or-Build Decision, Revisited: How Agentic AI Changes the Economics of Enterprise Software

The rise of agentic AI coding systems doesn't spell the end for SaaS, but it *does* fundamentally alter the economics of building in-house, creating a hybrid governance model that blends code ownership with dependence on external AI infrastructure.

David Klotz

Code Generation & Program Synthesis Tool Use & Agents

3w ago

SafeTune: Mitigating Data Poisoning in LLM Fine-Tuning for RTL Code Generation

Defend against hardware Trojans in LLM-generated RTL code by structurally and semantically verifying training data, without needing to alter the underlying LLM.

Mahshid Rezakhani, Nowfel Mashnoor, Kimia Azar +1

Code Generation & Program Synthesis Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness

Nyx Foundation3w ago·also Aichi Prefectural Aichi High School of Technology, Kyoto

Beyond Code Reasoning: A Specification-Anchored Audit Framework for Expert-Augmented Security Verification

Code-level security audits miss vulnerabilities arising from specification requirements, but SPECA finds them by reasoning directly from natural language specs.

Masato Kamba, Hirotake Murakami, Akiyoshi Sannai

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago

VulStyle: A Multi-Modal Pre-Training for Code Stylometry-Augmented Vulnerability Detection

Code stylometry, often overlooked, can significantly boost vulnerability detection, improving F1 scores by up to 48% on key benchmarks.

Chidera Biringa, Ajmal Abbas, Vishnu Selvaraj +1

Code Generation & Program Synthesis Multimodal Models

3w ago

An Empirical Security Evaluation of LLM-Generated Cryptographic Rust Code

LLMs fail to generate secure cryptographic code the vast majority of the time, with 57% of compiled samples containing exploitable vulnerabilities like nonce reuse.

Mohamed Elsayed, Kenneth Fulton, Jeong Yang

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago·also University of Plymouth

Understanding the Skills Gap between Higher Education Institutions and the Software Engineering Industry

UK computer science grads may be over-indexed on database management while woefully unprepared for the software design and planning skills that industry actually needs.

Huy Phan, Ievgeniia Kuzminykh, Bogdan Ghita

Code Generation & Program Synthesis Natural Language Processing

Bryn Mawr College3w ago

On the Effectiveness of Modular Testing with EvoSuite

Strict modular testing in EvoSuite tanks coverage, but relaxing target method isolation and prioritizing relevant call chains can boost coverage by 15%.

Elizabeth Dinella

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

3w ago·also IMT School for Advanced Studies Lucca

Will It Break in Production? Metric-Driven Prediction of Residual Defects in Python Systems

Forget LLMs, simple process metrics like code age and developer activity are the real MVPs for predicting bugs that slip into production Python code.

Giuseppe De Rosa, Pietro Liguori

Code Generation & Program Synthesis

Marco Robol +13w ago

Self-Evolving Software Agents

Forget hand-coded goals: these agents rewrite their own code and redefine their objectives on the fly, powered by LLMs.

Marco Robol, Paolo Giorgini

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Department of Computer Science3w ago·also Texas A&M

Now's the Time: Computer Science Must Evolve to Emphasize Software and Systems Engineering with Artificial Intelligence (AI)

CS education risks irrelevance if it continues to prioritize rote coding skills over the systems-level thinking needed to build and manage complex AI-driven systems.

Chandra N. Sekharan, George K. Thiruvathukal

Code Generation & Program Synthesis Constitutional AI & AI Ethics Natural Language Processing

Rabeya Khatun Muna +33w ago

CI-Repair-Bench: A Repository-Aware Benchmark for Automated Patch Validation via CI Workflows

Automated program repair still struggles in real-world CI environments, succeeding in less than 20% of cases, even with the best LLMs.

Rabeya Khatun Muna, Md Nakhla Rafi, Tse-Hsun +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

3w ago·also IMT School for Advanced Studies Lucca, Napoli

What Makes Software Bugs Escape Testing? Evidence from a Large-Scale Empirical Study

Post-release software bugs aren't just about code complexity; they're a symptom of code age, frequent modification, and high churn, demanding a shift in testing focus.

Domenico Cotroneo, Giuseppe De Rosa, Cristina Improta +1

Code Generation & Program Synthesis Open-Source Models & Weights

Halley Young +13w ago

Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves

Stop letting your research code, theory, and documentation drift apart: a new LM orchestration method keeps them synchronized, slashing error rates in a case study by over 50%.

Halley Young, Nikolaj Björner

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

University of Cagliari Cagliari3w ago·also UCL

Comparing Smart Contract Paradigms: A Preliminary Study of Security and Developer Experience

Resource-oriented smart contract languages like Move cut security code by 60%, suggesting a path to safer DeFi even if it means writing more code.

Matteo Vaccargiu, Andrea Pinna, Maria Ilaria Lunesu +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Maynooth University3w ago

Graph Construction and Matching for Imperative Programs using Neural and Structural Methods

Unlock verification artifact reuse across languages by representing programs as typed, attributed graphs that capture both structure and semantics.

Arshad Beg, Diarmuid O'Donoghue, Rosemary Monahan

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis

University of Jyväskylä3w ago·also Tampere

TDD Governance for Multi-Agent Code Generation via Prompt Engineering

Enforcing classical test-driven development principles directly within prompt orchestration enables more reliable and reproducible code generation from LLMs.

Tarlan Hasanli, Shahbaz Siddeeq, Bishwash Khanal +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

3w ago

Agentic AI in the Software Development Lifecycle: Architecture, Empirical Evidence, and the Reshaping of Software Engineering

Agentic AI has exploded in software engineering, achieving a 40x performance leap on SWE-bench in just 18 months, signaling a fundamental shift from code generation to AI-driven delegated execution.

Happy Bhati

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Tool Use & Agents

Virginia Commonwealth University3w ago·also SMU, University of Salerno

LLM-Assisted Empirical Software Engineering: Systematic Literature Review and Research Agenda

LLMs in software engineering are mostly used for automation, not decision support, and suffer from reproducibility issues, revealing a critical gap in human-centered integration and transparency.

Victoria Gomes, Delaney Selb, Fabio Palomba +1

Code Generation & Program Synthesis Natural Language Processing

3w ago·also Austrian Post

Recommendations for Efficient and Responsible LLM Adoption within Industrial Software Development

Forget hype, focus on human oversight: this study reveals practical, actionable recommendations for actually integrating LLMs into software development workflows responsibly.

Krishna Ronanki, Beatriz Cabrero-Daniel, Tomas Herda +3

Code Generation & Program Synthesis Constitutional AI & AI Ethics Natural Language Processing

3w ago

PICKLES: a Natural Language Framework for Requirement Specification and Model-Based Testing

Get significantly higher test coverage from your BDD scenarios by automatically translating them into formal models.

María Belén Rodríguez, Petra van den Bos

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

3w ago

RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates

LLMs can now generate more complete and up-to-date code documentation 3x faster while using 85% fewer tokens, thanks to a novel knowledge graph representation of code repositories.

Dong Xu, Mingwei Liu, Xiwen Wang +2

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Natural Language Processing

3w ago·also SMU

An Empirical Study of Speculative Decoding on Software Engineering Tasks

Smaller models get a bigger speed boost from Speculative Decoding on software engineering tasks, challenging the assumption that larger models always benefit more from inference acceleration techniques.

Yijia Li, Junkai Chen, Xing Hu +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Inference & Quantization

Pforzheim University of Applied Sciences3w ago

Asset Administration Shell-Based OCL Validation Framework for Model-Based System Engineering

Stop manually juggling MBSE models and OCL constraints: this framework uses Asset Administration Shells to automate validation and interpretation.

Om Parkash, Jannik Bauer, Vincent Schmitt +2

Code Generation & Program Synthesis Tool Use & Agents

Apr 28, 2026

Jinxiang Meng +243w ago

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

Current AI models are surprisingly inept at real-world data visualization tasks, failing more than half the time on a new benchmark designed to mimic enterprise workflows.

Jinxiang Meng, Shao-Gang Huang, Shaoping Huang +22

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Zhiyuan Fan +163w ago·also Tencent AI

Toward Scalable Terminal Task Synthesis via Skill Graphs

SkillSynth's skill graph approach lets you explicitly control the diversity of execution trajectories during terminal task synthesis, leading to more effective agent training.

Zhiyuan Fan, Tinghao Yu, Yuanjun Cai +14

Code Generation & Program Synthesis Data Curation & Synthetic Data Tool Use & Agents

Ajmain Inqiad Alam +43w ago

Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models

Slash your LLM's carbon footprint by up to 81% without sacrificing performance using a compression pipeline inspired by carbon taxation.

Ajmain Inqiad Alam, Palash Roy, Chanchal K. Roy +2

Code Generation & Program Synthesis Inference & Quantization Training Efficiency & Optimization

3w ago·also Tsinghua AI, CUHK, UChicago

From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling

Decentralized debate among LLM agents doesn't just select the best solution for optimization modeling; it structurally enables agents to refine flawed candidates and even recover correct formulations through interaction.

Jianghao Lin, Zi Ling, Chenyu Zhou +4

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

3w ago·also Quantstamp

PLMGH: What Matters in PLM-GNN Hybrids for Code Classification and Vulnerability Detection

Forget scaling laws: for code classification and vulnerability detection, the *right* code-specialized PLM matters more than GNN architecture or PLM size in PLM-GNN hybrids.

Mohamed Taoufik Kaouthar El Idrissi, Edward Zulkoski, Mohammad Hamdaqa

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Eval Frameworks & Benchmarks

ABB Robotics3w ago·also Mälardalen University

Bug-Report-Driven Fault Localization: Industrial Benchmarking and Lesson Learned at ABB Robotics

Transformer-based language models don't always win: simpler, TF-IDF-based models surprisingly outperform them in fault localization using industrial bug reports.

Pernilla Hall, Anton Ununger, Riccardo Rubei +1

Code Generation & Program Synthesis Natural Language Processing Robotics & Embodied AI

Ben-Gurion University of the Negev3w ago

SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?

Multi-agent code editing with structured failure feedback boosts task success by 17%, suggesting a promising path to more reliable LLM-driven code manipulation.

Noam Tarshish, Nofar Selouk, Daniel Hodisan +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

CASABLANCA hotelsoftware GmbH3w ago·also Diffblue Ltd, TU Munich, University of Innsbruck

RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements

LLM-generated API tests can be *less* effective when refined against faulty code, especially when requirements are vague, suggesting that blindly incorporating SUT behavior isn't always beneficial.

Leon Kogler, Stefan Hangler, Maximilian Ehrhart +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

3w ago·also Brown, UT Austin

Assistants, Not Architects: The Role of LLMs in Networked Systems Design

LLMs might sound good at designing networked systems, but they're surprisingly bad at avoiding configurations that violate basic constraints, highlighting the need for structured reasoning frameworks like Kepler.

Pratyush Sahu, Rahul Bothra, Venkat Arun +3

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Tool Use & Agents

3w ago·also Luxembourg Institute of Science and Technology, TJU

Learning Generalizable Multimodal Representations for Software Vulnerability Detection

Software vulnerability detection gets a serious upgrade: aligning code with developer comments boosts F1 scores by up to 27% compared to traditional code-only methods.

Zeming Dong, Yuejun Guo, Qiang Hu +2

Code Generation & Program Synthesis Multimodal Models Natural Language Processing

Loughborough University3w ago

AI as Consumer and Participant: A Co-Design Agenda for MBSE Substrates and Methodology

Current MBSE models are failing to leverage the full potential of AI, demanding a fundamental shift towards co-designing models and methodologies that prioritize machine-queryability.

Siyuan Ji

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

3w ago·also Stanford HAI

Automated Adversarial Collaboration for Advancing Theory Building in the Cognitive Sciences

LLMs can now automatically design and execute experiments to resolve debates between cognitive science theories, even discovering the models and experiments themselves.

Suyog Chandramouli, George Kachergis, Akshay Jagadish

Code Generation & Program Synthesis Scientific Discovery & Drug Design Tool Use & Agents

Universidad Austral Rosario3w ago

From CRUD to Autonomous Agents: Formal Validation and Zero-Trust Security for Semantic Gateways in AI-Native Enterprise Systems

Securing AI-native enterprise systems demands a shift from traditional software validation to dynamic formal verification of stochastic agent behavior, as demonstrated by a Semantic Gateway that uncovers 100% of unauthorized state transitions.

Ignacio Peyrano

Code Generation & Program Synthesis Constitutional AI & AI Ethics Tool Use & Agents

Quentin Vacher +33w ago

Multi-action Tangled Program Graphs for Multi-task Reinforcement Learning with Continuous Control

Evolving interpretable control policies for multi-task robots is now possible: MATPG leverages genetic programming to create a single agent that masters multiple continuous control tasks.

Quentin Vacher, Nicolas Beuve, Mickaël Dardaillon +1

Code Generation & Program Synthesis Robotics & Embodied AI Tool Use & Agents

Shivam Rawat +13w ago

Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows

Agentic AI systems can confidently generate plausible but wrong scientific results, even when given domain-specific context, highlighting a critical challenge for their integration into research workflows.

Shivam Rawat, Lucie Flek

Code Generation & Program Synthesis Scientific Discovery & Drug Design Tool Use & Agents

Hojae Han +33w ago

R$^3$-SQL: Ranking Reward and Resampling for Text-to-SQL

Text-to-SQL models can now achieve significantly higher accuracy by grouping and ranking SQL candidates based on execution results, then strategically resampling when the initial pool is lacking.

Hojae Han, Yeonseok Jeong, Zhewei Yao +1

Code Generation & Program Synthesis Natural Language Processing Recommendation & Information Retrieval

3w ago·also PKU, Shanghai Qiji Zhifeng Co

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Coding agents can now evolve their own harnesses to outperform human-designed ones, thanks to a novel observability-driven approach.

Jiahang Lin, Shichun Liu, Chengjun Pan +6

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

A. Kuznetsova3w ago

Does This Even Matter in the Real World? Real World Problems in Foundational Theory Courses

Students already believe foundational theory is relevant to their careers, so adding real-world examples may not be the best way to increase student buy-in.

A. Kuznetsova

Code Generation & Program Synthesis Natural Language Processing

Free University of Bozen-Bolzano3w ago·also Arizona, Nutrosal Inc.

Key Developer Roles and Organizational Coupling in Microservices: A Longitudinal Analysis

Organizational coupling in microservices isn't just about architecture – it's heavily influenced by the "Connector" roles bridging organizational silos, suggesting targeted interventions are possible.

Xiaozhou Li, Nariman Mani, Jose Sosa Rodriguez +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Distributed Systems & Hardware

3w ago

From Threads to Trajectories: A Multi-LLM Pipeline for Community Knowledge Extraction from GitHub Issue Discussions

Unlock expert developer reasoning: a new dataset distills complex GitHub issue discussions into structured trajectories, revealing the collaborative problem-solving process behind open-source software.

Nazia Shehnaz Joynab, Soneya Binta Hossain

Code Generation & Program Synthesis Data Curation & Synthetic Data Natural Language Processing

Jun Gao +113w ago

CoRE: A Fine-Grained Code Reasoning Benchmark Beyond Output Prediction

LLMs can nail the final answer in code execution but still fail to reason about the steps to get there, exposing a critical flaw in current evaluation methods.

Jun Gao, Yun Peng, Qian Qiao +9

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

The Open University3w ago·also INSEAD, Lancaster University, Manchester

Does social identity matter in software engineering? Assessing the case of research software engineers

RSEs aren't just coders; a strong collective identity shapes their professional wellbeing, revealing a crucial social dimension in software engineering.

Chukwudi Uwasomba, Melanie Langer, Helen Sharp +3

Code Generation & Program Synthesis Natural Language Processing

Search

Code Generation & Program Synthesis - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (100)