March 4 – March 11, 2026

Code Generation & Program Synthesis - Weekly Roundup

100 papers published across 7 labs.

70% acceleration

Selected Labs publishing this week

Microsoft Research3 Tsinghua AI2 DeepMind2 DAMO1 Meta AI1

Top Papers

Mar 11, 2026

Software Competence Center Hagenberg3w ago·also Symflower GmbH

An Approach for Safe and Secure Software Protection Supported by Symbolic Execution

Guaranteeing safety properties of copy-protected industrial software, even when executed on unintended hardware, becomes possible with a novel PUF-based binding and symbolic execution verification.

Daniel Dorfmeister, Flavio Ferrarotti, Bernhard Fischer +3

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness

3w ago·also CWI Amsterdam, Datadog, Erasmus University Rotterdam, Leiden

TOSSS: a CVE-based Software Security Benchmark for Large Language Models

LLMs struggle to identify software vulnerabilities, with even top models only achieving ~90% accuracy on a new CVE-based benchmark, suggesting significant risks in their application to software development.

Marc Damie, Murat Bilgehan Ertan, Domenico Essoussi +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Turku Bioscience Centre3w ago·also Åbo Akademi University, Foundation for the Finnish Cancer Institute, InFLAMES Research Flagship Centre, Instituto de Tecnologia Química e Biológica António Xavier +1

Packaging Jupyter notebooks as installable desktop apps using LabConstrictor

Turn your Jupyter notebooks into one-click installable desktop apps with LabConstrictor, democratizing access to computational methods for researchers without DevOps expertise.

Iván Hidalgo-Cenalmor, Marcela Xiomara Rivera Pineda, Bruno M. Saraiva +2

Code Generation & Program Synthesis Open-Source Models & Weights Scientific Discovery & Drug Design

3w ago

Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

LLMs can now synthesize high-performance kernels for niche hardware like NPUs, even with limited data, thanks to a self-evolving agent that bootstraps and refines code via value-driven reinforcement learning.

Yujie Zheng, Zhuo Li, Sheng Zhang +8

Code Generation & Program Synthesis Tool Use & Agents Training Efficiency & Optimization

Lingxiao Tang +63w ago

ExecVerify: White-Box RL with Verifiable Stepwise Rewards for Code Execution Reasoning

A 7B model, guided by verifiable execution rewards, can now rival the code reasoning of models more than four times its size.

Lingxiao Tang, He Ye, Zhaoyang Chu +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

All Papers (100)

Mar 11, 2026

3w ago·also CWI Amsterdam, Datadog, Erasmus University Rotterdam, Leiden

TOSSS: a CVE-based Software Security Benchmark for Large Language Models

Marc Damie, Murat Bilgehan Ertan, Domenico Essoussi +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Packaging Jupyter notebooks as installable desktop apps using LabConstrictor

Turn your Jupyter notebooks into one-click installable desktop apps with LabConstrictor, democratizing access to computational methods for researchers without DevOps expertise.

Iván Hidalgo-Cenalmor, Marcela Xiomara Rivera Pineda, Bruno M. Saraiva +2

Code Generation & Program Synthesis Open-Source Models & Weights Scientific Discovery & Drug Design

3w ago

Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

Yujie Zheng, Zhuo Li, Sheng Zhang +8

Code Generation & Program Synthesis Tool Use & Agents Training Efficiency & Optimization

Software Competence Center Hagenberg3w ago·also Symflower GmbH

An Approach for Safe and Secure Software Protection Supported by Symbolic Execution

Guaranteeing safety properties of copy-protected industrial software, even when executed on unintended hardware, becomes possible with a novel PUF-based binding and symbolic execution verification.

Daniel Dorfmeister, Flavio Ferrarotti, Bernhard Fischer +3

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness

Lingxiao Tang +63w ago

ExecVerify: White-Box RL with Verifiable Stepwise Rewards for Code Execution Reasoning

A 7B model, guided by verifiable execution rewards, can now rival the code reasoning of models more than four times its size.

Lingxiao Tang, He Ye, Zhaoyang Chu +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

3w ago

Resolving Java Code Repository Issues with iSWE Agent

Java codebases can now get state-of-the-art automated issue resolution thanks to iSWE Agent, which outperforms existing LLM agents by combining rule-based static analysis with LLMs.

Jatin Ganhotra, Sami Serhan, Antonio Abu Nassar +3

Code Generation & Program Synthesis Tool Use & Agents

Tobias Geger +43w ago

From Education to Evidence: A Collaborative Practice Research Platform for AI-Integrated Agile Development

An AI-integrated agile education platform accelerates practice-relevant AI research by closing the theory-practice gap in software development.

Tobias Geger, Andreas Rausch, Ina Schiering +2

Code Generation & Program Synthesis Open-Source Models & Weights Tool Use & Agents

3w ago

Unveiling Practical Shortcomings of Patch Overfitting Detection Techniques

Current patch overfitting detection techniques are largely useless in practice, as simple random selection outperforms them in the vast majority of cases.

David Williams, Ioakim Avraam, A. Aleti +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Zhiyuan Zeng +143w ago

Understanding by Reconstruction: Reversing the Software Development Process for LLM Pretraining

LLMs can be made better software engineers by pre-training them to reconstruct the messy, iterative development process that led to the final, clean code in repositories.

Zhiyuan Zeng, Yichi Zhang, Yong Shan +12

Code Generation & Program Synthesis Data Curation & Synthetic Data Reasoning & Chain-of-Thought

Ezequiel López-Rubio +13w ago

Instruction set for the representation of graphs

Representing graphs as strings with a guaranteed-valid instruction set unlocks language model-based approaches for graph similarity, generation, and conditioned modeling.

Ezequiel López-Rubio, Mario Pascual-González

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis

Carlos Alberto Fern'andez-y-Fern'andez +13w ago

Artificial Intelligence as a Catalyst for Innovation in Software Engineering

AI's integration into software engineering isn't just streamlining existing Agile processes; it's unlocking entirely new capabilities for maintaining quality and speed under pressure.

Carlos Alberto Fern'andez-y-Fern'andez, Jorge R. Aguilar-Cisneros

Code Generation & Program Synthesis Tool Use & Agents

Technical University of Košice3w ago

Bridging Behavioral Biometrics and Source Code Stylometry: A Survey of Programmer Attribution

Programmer attribution research is heavily skewed towards stylometric features and closed-world scenarios, leaving behavioral biometrics and open-world verification largely unexplored.

Marek Horváth, E. Pietriková, D. Spinellis

Code Generation & Program Synthesis Natural Language Processing

Heinz Nixdorf Institute at Paderborn3w ago·also Fraunhofer

FP-Predictor - False Positive Prediction for Static Analysis Reports

A GCN model trained on static analysis reports can achieve near-perfect accuracy in distinguishing true vulnerabilities from false positives, even uncovering genuine security weaknesses missed by the original SAST tools.

Tom Ohlmer, Michael Schlichtig, Eric Bodden

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis

3w ago·also Soochow, UESTC, USTC

Don't Let the Claw Grip Your Hand: A Security Analysis and Defense Framework for OpenClaw

Open-source code agents like OpenClaw are sitting ducks for shell command attacks, but a simple human-in-the-loop intervention can dramatically boost their security.

Zhengyang Shan, Jiayu Xin, Yue Zhang +1

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Tool Use & Agents

Kansas State University3w ago·also NYU, TU Munich

Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation: Quality, Reliability, and Failure Modes

LLMs generating hardware code often fail *after* synthesis, and the type of failure (elaboration errors vs. missing wrappers) systematically depends on whether the model is proprietary or open-weight.

Weimin Fu, Zeng Wang, Minghao Shao +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

3w ago

Security-by-Design for LLM-Based Code Generation: Leveraging Internal Representations for Concept-Driven Steering Mechanisms

CodeLLMs often *know* they're generating insecure code, and you can steer them toward security by manipulating their internal representations during token generation.

Maximilian Wendlinger, Daniel Kowatsch, Konstantin Bottinger +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Martin Obaidi +43w ago

Exploring Indicators of Developers'Sentiment Perceptions in Student Software Projects

Sentiment perception in software development is more unstable and statement-dependent than you think, suggesting caution when interpreting sentiment analysis outputs.

Martin Obaidi, Marc Herrmann, J. Martensen +2

Code Generation & Program Synthesis Natural Language Processing

Tim Menzies +13w ago

From Verification to Herding: Exploiting Software's Sparsity of Influence

Forget exhaustive verification: a surprisingly small number of tests can steer complex software systems towards desired goals by exploiting the "Sparsity of Influence".

Tim Menzies, Kishan Kumar Ganguly

Code Generation & Program Synthesis Tool Use & Agents

Tsinghua AI3w ago·also DAMO, NanKai University, NJU, Scale +1

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Forget scaling reasoning – this work shows that scaling visual perception using code-grounded data is the real key to unlocking MLLMs' STEM abilities.

Tongkun Guan, Zhibo Yang, Jianqiang Wan +13

Code Generation & Program Synthesis Multimodal Models Reasoning & Chain-of-Thought

3w ago·also BlockSec

Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?

AI agents can detect smart contract vulnerabilities, but don't expect them to autonomously exploit real-world security incidents anytime soon.

Chaoyuan Peng, Lei Wu, Yajin Zhou

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

J. M. Murillo +213w ago

QuantumX: an experience for the consolidation of Quantum Computing and Quantum Software Engineering as an emerging discipline

Spain is emerging as a key player in the quantum software ecosystem, pioneering the application of established software engineering principles to the nascent field of quantum computing.

J. M. Murillo, Ignacio Garc'ia-Rodr'iguez de Guzm'an, E. Moguel +19

Code Generation & Program Synthesis

Mar 10, 2026

Tsinghua AI3w ago·also Beihang, York

An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

LLMs in collaborative coding often stumble on interaction subtleties, leading to a new class of problems called "Interaction Smells" that can now be systematically identified and mitigated.

Binquan Zhang, Li Zhang, Lin Shi +6

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Zuhao Zhang +63w ago

MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

LLMs still struggle to generate high-quality interactive HTML applications, despite their advancements in code generation, highlighting a gap that MiniAppBench aims to address.

Zuhao Zhang, Chengyue Yu, Yuante Li +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Universidade Federal de Minas Gerais3w ago

Idempotent Slices with Applications to Code-Size Reduction

Achieve up to 7.24% code-size reduction by identifying and extracting idempotent backward slices, enabling the merging of non-contiguous instruction sequences within and across functions.

Rafael Alvarenga de Azevedo, Daniel Augusto Costa de Sa, Rodrigo Caetano Rocha

Code Generation & Program Synthesis Inference & Quantization Training Efficiency & Optimization

Blekinge Institute of Technology3w ago

Experience Report on the Adaptable Integration of Requirements Engineering Courses into Curricula for Professionals

Successfully integrating RE courses into professional software engineering curricula requires a systematic approach to course content mapping, addressing the unique demands of professionals.

Oleksandr Kosenkov, Konstantin Blaschke, Tony Gorschek +3

Code Generation & Program Synthesis Natural Language Processing

FAIR CodeGen Team3w ago

Towards a Neural Debugger for Python

LLMs can now emulate debuggers, stepping through code and setting breakpoints, opening the door to more interactive and controllable neural program execution.

Maximilian Beck, Jonas Gehring, Jannik Kossen +1

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

ToolRosetta: Bridging Open-Source Repositories and Large Language Model Agents through Automated Tool Standardization

Automating the messy process of turning open-source code into LLM tools unlocks a new level of agent capabilities, outperforming even commercial LLMs.

Shimin Di, Xujie Yuan, Hanghui Guo +9

Code Generation & Program Synthesis Open-Source Models & Weights Tool Use & Agents

Aman Sharma +13w ago

EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

LLMs that ace standard coding benchmarks spectacularly fail at esoteric languages, revealing a reliance on memorization rather than true reasoning.

Aman Sharma, Paras Chopra

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Andrew Murray +33w ago

GenePlan: Evolving Better Generalized PDDL Plans using Large Language Models

LLMs can evolve surprisingly effective, interpretable Python planners that rival state-of-the-art classical planners, at a fraction of the computational cost.

Andrew Murray, Danial Dervovic, Alberto Pozanco +1

Code Generation & Program Synthesis Reasoning & Chain-of-Thought World Models & Planning

G. Edwards +13w ago

Synergistic Directed Execution and LLM-Driven Analysis for Zero-Day AI-Generated Malware Detection

LLMs can now help you catch AI-generated malware: a hybrid analysis framework uses LLMs to guide concolic execution and deep learning to classify vulnerabilities, achieving state-of-the-art detection rates.

G. Edwards, Mahdi Eslamimehr

Code Generation & Program Synthesis Natural Language Processing Red-Teaming & Adversarial Robustness

3w ago·also University College Dublin

Class Model Generation from Requirements using Large Language Models

LLMs can now generate UML diagrams from requirements with human-level quality, potentially automating a resource-intensive phase in software design.

Jackson Nguyen, Rui En Koe, Fanyu Wang +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

DeepMind3w ago

Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models

Forget black-box policies: CSRO uses LLMs to generate human-readable code policies in multi-agent RL, achieving performance competitive with traditional methods.

Daniel Hennes, Zun Li, John Schultz +1

Code Generation & Program Synthesis Interpretability & Mechanistic Interp Tool Use & Agents

3w ago·also TeleAI

IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework

By having a single VLM critique its own SVG renderings, IntroSVG learns to generate more complex, semantically aligned, and editable vector graphics from text prompts.

Feiyu Wang, Jiayuan Yang, Zhiyuan Zhao +3

Code Generation & Program Synthesis Computer Vision Multimodal Models

3w ago·also AIT

Declarative Scenario-based Testing with RoadLogic

RoadLogic automates the creation of diverse, realistic autonomous vehicle test scenarios from declarative specifications, sidestepping the manual effort of imperative approaches.

Ezio Bartocci, Alessio Gambi, Felix Gigler +2

Code Generation & Program Synthesis Robotics & Embodied AI World Models & Planning

3w ago

Compartmentalization-Aware Automated Program Repair

Securing vulnerable cross-compartment interfaces may be possible with a new APR framework that bridges the compartmentalization awareness gap in existing LLMs.

Jia Hu, Pierre Olivier

Code Generation & Program Synthesis

Blekinge Institute of Technology3w ago

Towards Viewpoint-centric Artifact-based Regulatory Requirements Engineering for Compliance by Design

Bridging the gap between organizational-level regulatory processes and ad-hoc software development team practices could unlock more systematic compliance by design.

Oleksandr Kosenkov

Code Generation & Program Synthesis Natural Language Processing

Lorenzo Corrias +23w ago·also University of Cagliari

An Analysis of Modern Web Security Vulnerabilities Inside WebAssembly Applications

WASM's promise of secure sandboxing crumbles as this study reveals how binary vulnerabilities within WASM modules can be chained to exploit common web application weaknesses like SQL injection and cross-site leaks.

Lorenzo Corrias, Lorenzo Pisu, Davide Maiorca

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness

Stefan Wittek +13w ago

Preparing Students for AI-Driven Agile Development: A Project-Based AI Engineering Curriculum

Forget separate lectures: this AI Engineering curriculum throws students into interdisciplinary agile projects, embedding AI tools directly into their workflows for a hands-on, future-proofed learning experience.

Stefan Wittek, David Inkermann

Code Generation & Program Synthesis Tool Use & Agents

3w ago

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

Forget data quantity, diversity is the secret sauce: scaling the variety of tool-use patterns in training data boosts LLM generalization by +22 points on OOD benchmarks, even with 4x less data.

Aili Chen, Chi Zhang, Junteng Liu +11

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

3w ago·also Massey University, University of Canterbury, University of Otago

The Future of Software Engineering Conferences: A New Zealand Perspective

Overlooked no more: practical strategies can make software engineering conferences far more accessible to researchers in remote regions like New Zealand.

Kelly Blincoe, Sherlock A. Licorish, Judith Fuchs +1

Code Generation & Program Synthesis Natural Language Processing

3w ago·also Meta AI

Wrong Code, Right Structure: Learning Netlist Representations from Imperfect LLM-Generated RTL

Imperfect code from LLMs can still teach AI to understand circuit structure, unlocking a scalable path to netlist representation learning without expensive, clean datasets.

Siyang Cai, Cangyuan Li, Yinhe Han +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Data Curation & Synthetic Data

Yinjie Wang +43w ago

OpenClaw-RL: Train Any Agent Simply by Talking

Forget finetuning on curated datasets – OpenClaw-RL lets agents learn directly and continuously from *every* interaction, turning user replies, tool outputs, and even GUI changes into valuable RL signals.

Yinjie Wang, Xuyang Chen, Xiaolong Jin +2

Code Generation & Program Synthesis RLHF & Preference Learning Tool Use & Agents

3w ago

EmbC-Test: How to Speed Up Embedded Software Testing Using LLMs and RAG

Slash embedded software testing time by up to 66% with an LLM-powered RAG pipeline that generates 270 syntactically correct unit tests per hour.

Maximilian Harnot, Sebastian Komarnicki, Michal Polok +1

Code Generation & Program Synthesis Recommendation & Information Retrieval

Mar 9, 2026

Tzafrir Rehan3w ago

Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications

Forget prompt engineering voodoo: this framework treats agent prompts as compiled artifacts, using tests to drive development and catch silent regressions before they hit production.

Tzafrir Rehan

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

3w ago

Arbiter: Detecting Interference in LLM Agent System Prompts

For pennies, a new framework reveals critical vulnerabilities in the system prompts of leading coding agents like Claude, Codex, and Gemini, demonstrating the power of multi-model LLM scouring.

Tony Mason

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Tool Use & Agents

Yi Chen +43w ago

SCAFFOLD-CEGIS: Preventing Latent Security Degradation in LLM-Driven Iterative Code Refinement

LLM-driven iterative code refinement can paradoxically degrade security over time, and simply adding SAST worsens the problem.

Yi Chen, Yun Bian, Haiquan Wang +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago

Practical Type Inference: High-Throughput Recovery of Real-World Structures and Function Signatures

Recovering types from stripped binaries just got a whole lot faster: XTRIDE achieves up to 2300x speedup in struct recovery while maintaining state-of-the-art accuracy.

Lukas Seidel, Sam Thomas, K. Rieck +1

Code Generation & Program Synthesis Inference & Quantization

V'ictor Mayoral-Vilches +113w ago·also Alias Robotics

Cybersecurity AI: Hacking Consumer Robots in the AI Era

Generative AI has democratized robot hacking, enabling anyone to uncover critical vulnerabilities in consumer robots that previously demanded months of expert security research.

V'ictor Mayoral-Vilches, Víctor Mayoral-Vilches, Unai Ayucar-Carbajo +9

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Robotics & Embodied AI

3w ago

OSS-CRS: Liberating AIxCC Cyber Reasoning Systems for Real-World Open-Source Security

AI-powered cyber reasoning can now find real-world bugs in open-source software thanks to a new framework that liberates DARPA's AI Cyber Challenge systems from their inaccessible cloud origins.

Andrew Chin, Dongkwan Kim, Yu-Fu Fu +8

Code Generation & Program Synthesis Open-Source Models & Weights

Andre Vehreschild +33w ago

Experience on Automatically Converting a C++ Monolith to Java EE

Converting a massive C++ monolith to Java EE isn't just possible, it's achievable with automated tooling and careful handling of C++-specific constructs.

Andre Vehreschild, A. Vehreschild, L. Pimenidis +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis

Do Ngoc Tiep +23w ago

ConnChecker: Automated Root-Cause Analysis for Formal Connectivity Check via Graph

Slash SoC debugging time by up to 80% with ConnChecker, a graph-based tool that automates root-cause analysis for formal connectivity checks.

Do Ngoc Tiep, Nguyen Linh Anh, Luu Danh Minh

Code Generation & Program Synthesis

Lucas Shen +13w ago

Social Proof is in the Pudding: The (Non)-Impact of Social Proof on Software Downloads

Turns out, buying stars and downloads for open-source software doesn't actually trick developers into using it.

Lucas Shen, Gaurav Sood

Code Generation & Program Synthesis Open-Source Models & Weights Recommendation & Information Retrieval

Meta AI3w ago·also JHU

SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation

Even the best open-weight LLMs still fail on nearly two-thirds of questions requiring reasoning over scientific tables, highlighting a persistent "execution bottleneck" in translating strategy to action.

Hexuan Wang, Yaxuan Ren, Srikar Bommireddypalli +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

BAIR3w ago·also UC Santa Cruz

In-Context Reinforcement Learning for Tool Use in Large Language Models

Skip the expensive supervised fine-tuning: this RL-only method teaches LLMs to use tools by showing them how in-context, then gradually removing the crutches until they're tool-using pros in zero-shot.

Yaoqi Ye, Yiran Zhao, Keyu Duan +5

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Xin-Cheng Wen +53w ago

SWE-Fuse: Empowering Software Agents via Issue-free Trajectory Learning and Entropy-aware RLVR Training

Noisy issue descriptions holding back your software agent? SWE-Fuse unlocks 60% higher solve rates by fusing issue-guided and issue-free training trajectories.

Xin-Cheng Wen, Binbin Chen, Haoxuan Lan +3

Code Generation & Program Synthesis Data Curation & Synthetic Data Tool Use & Agents

3w ago

Can AI Agents Generate Microservices? How Far are We?

LLMs can generate microservices with surprisingly maintainable code and strong API adherence, but don't ditch your DevOps team just yet: correctness is still inconsistent and human oversight is essential.

Matteo Esposito, Davide Taibi

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Mohammad Abboush +23w ago

An explainable hybrid deep learning-enabled intelligent fault detection and diagnosis approach for automotive software systems validation

Stop blindly trusting your fault detection models: this hybrid CNN-GRU approach uses explainable AI to reveal the reasoning behind its predictions, enabling adaptation and root cause analysis in automotive software validation.

Mohammad Abboush, Ehab Ghannoum, Andreas Rausch

Code Generation & Program Synthesis Interpretability & Mechanistic Interp

Ben Rank +63w ago

PostTrainBench: Can LLM Agents Automate LLM Post-Training?

LLM agents can automate LLM post-training, but watch out – they'll try to cheat if you let them.

Ben Rank, Hardik Bhatnagar, Ameya Prabhu +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

3w ago

Supporting Workflow Reproducibility by Linking Bioinformatics Tools across Papers and Executable Code

Bridging the gap between narrative descriptions and workflow implementations, CoPaLink automatically links bioinformatics tools mentioned in papers to their usage in code, boosting reproducibility.

Clémence Sebe, Olivier Ferret, Aur'elie N'ev'eol +4

Code Generation & Program Synthesis Open-Source Models & Weights Scientific Discovery & Drug Design+1

Jiaqi Chen +23w ago

More to Extract: Discovering MEV by Token Contract Analysis

You're leaving money on the table: a new searcher extracts 10x more MEV by exploiting overlooked vulnerabilities in token smart contracts.

Jiaqi Chen, Yuzhe Tang, Yue Duan

Code Generation & Program Synthesis

Ali Fattahdizaji +33w ago

SmartGraphical: A Human-in-the-Loop Framework for Detecting Smart Contract Logical Vulnerabilities via Pattern-Driven Static Analysis and Visual Abstraction

A human-in-the-loop approach to smart contract analysis can catch subtle logical vulnerabilities that automated tools miss, as demonstrated by its success in identifying flaws in high-profile exploits.

Ali Fattahdizaji, Mohammad Pishdar, Z. Shukur +1

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness

University of Lübeck3w ago

Coverage-Guided Multi-Agent Harness Generation for Java Library Fuzzing

LLM-powered agents can autonomously generate fuzz harnesses for Java libraries, outperforming existing automated approaches and even uncovering bugs in well-fuzzed code.

Nils Loose, Nico Winkel, Kristoffer Hempel +5

Code Generation & Program Synthesis Tool Use & Agents

University of Victoria3w ago·also SMU

GenAI Is No Silver Bullet for Qualitative Research in Software Engineering

Claims that GenAI can automate qualitative analysis in software engineering are premature, as its effectiveness hinges on careful adaptation to specific data and research strategies.

Neil A. Ernst, Christoph Treude

Code Generation & Program Synthesis Constitutional AI & AI Ethics Natural Language Processing

Hacon3w ago·also Siemens AI

Human-AI Collaboration for Scaling Agile Regression Testing: An Agentic-AI Teammate from Manual to Automated Testing

Freeing up developers from tedious manual test scripting, an agentic AI teammate boosts test script throughput in agile regression testing.

Moustapha El Outmani, M. Shenoy, Manthan Venkataramana Shenoy +8

Code Generation & Program Synthesis Tool Use & Agents

J. Sivaloganathan +33w ago

Automating Detection and Root-Cause Analysis of Flaky Tests in Quantum Software

LLMs can now automatically detect and diagnose flaky tests in quantum software with high accuracy, potentially saving quantum software developers significant time and effort.

J. Sivaloganathan, Ainaz Jamshidi, Andriy V. Miranskyy +1

Code Generation & Program Synthesis

Haodong Li +133w ago

CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

Forget fuzzy language – CoCo uses executable code as Chain-of-Thought to generate images with unprecedented control and precision, blowing away existing methods on complex scenes.

Haodong Li, Chunmei Qing, Huanyu Zhang +11

Code Generation & Program Synthesis Multimodal Models Reasoning & Chain-of-Thought

Mar 8, 2026

University of Calgary3w ago

Empathy in Software Engineering Education: Evidence, Practices, and Opportunities

Software engineering education is increasingly recognizing empathy as a measurable pedagogical construct, moving beyond a peripheral "soft skill."

Matheus de Morais Leca, Kim Johnston

Code Generation & Program Synthesis Constitutional AI & AI Ethics Natural Language Processing

Microsoft Research3w ago·also Cambridge, Qinzheng Sun1

Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems

Forget massive datasets – targeted training on a smaller, carefully curated dataset of challenging competitive programming problems yields 3x faster gains in code generation performance.

Zongqian Li, Tengchao Lv, Shaohan Huang +7

Code Generation & Program Synthesis Data Curation & Synthetic Data RLHF & Preference Learning+1

3w ago

The Effect of Code Obfuscation on Human Program Comprehension

Code obfuscation doesn't always make things harder for humans: certain renaming techniques in Python can actually *improve* program comprehension compared to the original code.

Anh H. N. Nguyen, Jack Le, Ilse Lahnstein Coronado +1

Code Generation & Program Synthesis Interpretability & Mechanistic Interp

3w ago·also BIGAI, DUT, HKUST, NSFC +1

Dial: A Knowledge-Grounded Dialect-Specific NL2SQL System

Stop struggling with SQL dialects: Dial offers a knowledge-grounded approach that boosts NL2SQL accuracy by 10% and feature coverage by 15% across diverse database systems.

Xiang Zhang, Hongming Xu, Le Zhou +7

Code Generation & Program Synthesis Natural Language Processing

Microsoft Research3w ago·also BIT, Cambridge

Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models

By rethinking RLHF, MicroCoder-GRPO enables smaller code generation models to rival larger counterparts, achieving significant performance gains and revealing 34 training insights.

Zongqian Li, Shaohan Huang, Zewen Chi +5

Code Generation & Program Synthesis RLHF & Preference Learning Training Efficiency & Optimization

3w ago·also Edith Cowan University (Australia)

An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data

FusionSQL lets you evaluate Text2SQL models on new databases without any labels, saving time and money while ensuring quality.

Trinh Pham, Thanh Tam Nguyen, Viet Huynh +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Georgios Koukis +103w ago

Performance Evaluation of Automated Multi-Service Deployment in Edge-Cloud Environments with the CODECO Toolkit

Automating multi-service deployments in edge-cloud environments doesn't have to be a headache: CODECO slashes manual effort while keeping performance competitive.

Georgios Koukis, Ioannis Dermentzis, Vassilis Tsaoussidis +8

Code Generation & Program Synthesis Distributed Systems & Hardware

Maria Teresa Baldassarre3w ago

The role of team diversity in AI systems development

Diverse AI development teams don't just tick a box; they're your secret weapon against bias, injecting empathy and broadening problem-solving to build fairer systems.

Maria Teresa Baldassarre

Code Generation & Program Synthesis Constitutional AI & AI Ethics

CESAR School3w ago

Regression Testing in Remote and Hybrid Software Teams: An Exploratory Study of Processes, Tools, and Practices

Remote and hybrid teams are leaning heavily on documentation, automation, and tool integration to maintain regression testing quality, suggesting a shift from informal co-located practices to more formalized, asynchronous workflows.

Juliane Pascoal, Cleytton Magalhaes

Code Generation & Program Synthesis

3w ago·also College of Computer Science, NTU

On the Effectiveness of Code Representation in Deep Learning-Based Automated Patch Correctness Assessment

Graph-based code representations, largely unexplored in automated patch correctness assessment, crush sequence- and heuristic-based methods, achieving 82.6% accuracy in predicting patch correctness.

Quanjun Zhang, Haichuan Hu, Yuan Zhao +3

Code Generation & Program Synthesis

3w ago

TableMind++: An Uncertainty-Aware Programmatic Agent for Tool-Augmented Table Reasoning

Table reasoning gets a reliability boost: TableMind++ uses uncertainty estimates to prune flawed plans and refine actions, outperforming prior models by synthesizing robust reasoning paths.

Mingyue Cheng, Shuo Yu, Chuang Jiang +4

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

3w ago·also Corresponding author, SEU

KCoEvo: A Knowledge Graph Augmented Framework for Evolutionary Code Generation

LLMs struggle with code migration when APIs evolve, but KCoEvo's knowledge graph augmentation boosts migration accuracy and execution success.

Jiazhen Kang, Yuchen Lu, Chen Jiang +5

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Mar 7, 2026

Yuxuan Han +43w ago

Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts

LLMs can now optimize CUDA kernels across diverse scientific computing and LLM workloads, rivaling hand-tuned libraries like cuBLAS.

Yuxuan Han, Meng-Hao Guo, Zhengning Liu +2

Code Generation & Program Synthesis Distributed Systems & Hardware Training Efficiency & Optimization

3w ago

Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context

Forget fancy recursion: uncertainty-aware self-reflection alone can boost long-context language model performance by up to 22%, even surpassing Recursive Language Models (RLM).

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Mar 6, 2026

Juyong Jiang +53w ago

ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

Forget external debuggers: ReflexiCoder teaches LLMs to self-reflect and self-correct code, rivaling GPT-5.1 in performance while slashing inference costs by 40%.

Juyong Jiang, Jiasi Shen, Sunghun Kim +3

Code Generation & Program Synthesis Reasoning & Chain-of-Thought RLHF & Preference Learning

Mar 5, 2026

Hong Kong Polytechnic Univeristy3w ago

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

LLMs can now tap into the full power of R's statistical methods: a new retrieval method boosts package retrieval accuracy by 17% by understanding data distributions, not just function names.

Maojun Sun, Yue Wu, Yifei Xie +5

Code Generation & Program Synthesis Recommendation & Information Retrieval Tool Use & Agents

Shane Lee +23w ago

Deterministic Preprocessing and Interpretable Fuzzy Banding for Cost-per-Student Reporting from Extracted Records

Stop relying on opaque spreadsheet magic: this tool provides a reproducible, auditable pipeline for turning raw academic data into interpretable cost-per-student reports.

Shane Lee, S. Ng, Stella Ng

Code Generation & Program Synthesis Natural Language Processing

Shu-Mei Yang +53w ago·also Soyeon Caren Han is the corresponding

EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection

LLM agents can now evolve better tool-use policies without gradients, thanks to a blame-aware mutation and diversity-aware selection process that pinpoints and fixes errors in individual modules.

Shu-Mei Yang, Soyeon Caren Han, Xueqi Ma +3

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

David Delgado +33w ago

A framework for assessing the capabilities of code generation of constraint domain-specific languages with large language models

LLMs struggle with niche DSLs like OCL and Alloy compared to Python, but surprisingly, simple techniques like code repair can significantly boost their performance.

David Delgado, Lola Burgueño, Robert Claris'o +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

3w ago

From Code to Road: A Vehicle-in-the-Loop and Digital Twin-Based Framework for Central Car Server Testing in Autonomous Driving

Ditch the ECU-by-ECU grind: this ViL framework lets you test full autonomous driving stacks on a central car server by syncing a real vehicle with its digital twin.

Chengdong Wu, Sven Kirchner, Nils Purschke +11

Code Generation & Program Synthesis Robotics & Embodied AI World Models & Planning

3w ago·also TIB - Leibniz Information Centre

ARC-TGI: Human-Validated Task Generators with Reasoning Chain Templates for ARC-AGI

Forget static benchmarks: ARC-TGI offers a dynamic, human-validated approach to generating ARC-AGI tasks, enabling scalable dataset sampling and controlled benchmarking.

Jens Lehmann, S. Khushbakht, Syeda Khushbakht +6

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Nghi D. Q. Bui3w ago

Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned

A terminal-native coding agent, OPENDEV, achieves robust autonomous software engineering by enforcing explicit reasoning phases and prioritizing context efficiency, offering a blueprint for secure and extensible AI assistance.

Nghi D. Q. Bui

Code Generation & Program Synthesis Open-Source Models & Weights Tool Use & Agents

Philipp-Lorenz Glaser +23w ago

A Benchmarking Framework for Model Datasets

Stop building software model datasets in the dark: a new benchmarking framework brings rigor and comparability to MDE dataset evaluation.

Philipp-Lorenz Glaser, Lola Burgueño, Dominik Bork

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Max Eland +63w ago

NL2GDS: LLM-aided interface for Open Source Chip Design

LLMs can now generate chip layouts from natural language descriptions, achieving significant performance improvements over traditional designs.

Max Eland, J. Thiyagalingam, Jeyan Thiyagalingam +4

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Tool Use & Agents

Justin Wang +73w ago

EVMbench: Evaluating AI Agents on Smart Contract Security

AI agents can already exploit real-world smart contract vulnerabilities end-to-end, raising critical security concerns for blockchain applications.

Justin Wang, Andreas Bigger, Xiaohai Xu +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

3w ago

The Trilingual Triad Framework: Integrating Design, AI, and Domain Knowledge in No-code AI Smart City Course

Forget passive AI use: this framework shows how students can actively design AI systems by orchestrating domain knowledge, design principles, and AI architecture, leading to enhanced AI literacy and metacognition.

Qian Huang, King Wang Poon

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Tool Use & Agents

3w ago

Behaviour Driven Development Scenario Generation with Large Language Models

Claude 3 beats GPT-4 in generating high-quality BDD scenarios as judged by humans, even though GPT-4 scores higher on traditional text similarity metrics.

Amila Rathnayake, Mojtaba Shahin, Golnoush Abaei

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Microsoft Research3w ago·also OpenAI, KU, RIKEN, UTokyo

RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform

Automating software repository build and testing across languages and platforms is now possible, unlocking scalable benchmarking and training for coding agents.

Kenan Li, Rongzhi Li, Qirui Jin +15

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Mengnan Li +73w ago

MOOSEnger -- a Domain-Specific AI Agent for the MOOSE Ecosystem

MOOSEnger achieves a 93% success rate in generating runnable multiphysics simulation inputs from natural language, while LLMs alone fail 92% of the time.

Mengnan Li, Jason M. Miller, Jason Miller +5

Code Generation & Program Synthesis Recommendation & Information Retrieval Tool Use & Agents

3w ago·also DeepMind, Defence Science and Technology Group

Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models

Achieve significantly better code generation and mathematical problem solving from diffusion language models with a simple, training-free sampling tweak that encourages diversity.

Sean Lamont, Christian J. Walder, Christian Walder +3

Code Generation & Program Synthesis Natural Language Processing Reasoning & Chain-of-Thought

Mar 4, 2026

LikeThis! Empowering App Users to Submit UI Improvement Suggestions Instead of Complaints

Stop sifting through vague user complaints: LikeThis! uses GenAI to transform them into actionable UI improvement suggestions, complete with visual alternatives.

Jialiang Wei, Ali Ebrahimi Pourasad, Walid Maalej

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

KevenMar 4, 2026

A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development

LLMs alone can't reliably build WebGIS tools; externalized governance using knowledge graphs and structured architectures is key to overcoming context constraints, stochasticity, and other limitations.

Boyuan, Boyuan Guan, Wencong Cui +3

Code Generation & Program Synthesis Constitutional AI & AI Ethics Tool Use & Agents

Mar 4, 2026·also National Technology Innovation Center, UT Austin

iScript: A Domain-Adapted Large Language Model and Benchmark for Physical Design Tcl Script Generation

LLMs can now generate Innovus Tcl scripts for physical design with higher accuracy, thanks to a new domain-adapted model and benchmark that tackles the data scarcity problem.

Zhaoyang Zhang, Senlin Shu, Lei Qi +7

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Mar 4, 2026

BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft

Craft pixel-perfect Minecraft skins from just a character concept with BLOCK, an open-source pipeline that leverages MLLMs and progressive LoRA fine-tuning.

Hengquan Guo

Code Generation & Program Synthesis Computer Vision Multimodal Models