April 20 – April 27, 2026

Code Generation & Program Synthesis - Weekly Roundup

100 papers published across 5 labs.

1900% acceleration

Selected Labs publishing this week

Tsinghua AI4 BAIR1 NUS1 Stanford HAI1 ETH1

Top Papers

Apr 27, 2026

Apr 27, 2026·also Tsinghua AI

From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

LLM agents can better discover and assess risks of skills when those skills are represented in a structured format that explicitly represents scheduling, execution structure, and logic, rather than relying on unstructured text.

Qiliang Liang, Hansi Wang, Zhongzhi Liang +1

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Bilkent UniversityApr 27, 2026·also Adelaide University

Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions

Evaluating LLM-powered software engineering tools is fundamentally broken, as traditional metrics fail to capture the nuanced, non-deterministic nature of their outputs.

U. B. Torun, Veli Karakaya, Ali Babar +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Apr 23, 2026

NUSApr 23, 2026·also Beihang, Passau, SJTU

Generalizing Test Cases for Comprehensive Test Scenario Coverage

Stop writing incomplete tests: TestGeneralizer can automatically expand your existing tests to cover 31% more scenarios and catch more bugs.

Yun Lin, Xinyi Weng, Hailong Sun +2

Code Generation & Program Synthesis

Apr 27, 2026

Mohammadmehdi Ataei +7Apr 27, 2026

Zero-to-CAD: Agentic Synthesis of Interpretable CAD Programs at Million-Scale Without Real Data

Forget painstakingly collecting real CAD data – Zero-to-CAD lets you bootstrap CAD program generation from multi-view images using a million-scale dataset synthesized entirely by an LLM agent.

Mohammadmehdi Ataei, Mohammadmehdi Ataei, Farzaneh Askari +5

Code Generation & Program Synthesis Data Curation & Synthetic Data Tool Use & Agents

Joshua Sherwood +5Apr 27, 2026

Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver

Frontier AI agents can now autonomously recreate sophisticated ML pipelines like AlphaZero for Connect Four, signaling a leap in their ability to accelerate AI research itself.

Joshua Sherwood, Joshua Sherwood, Ben Aybar +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

All Papers (100)

Apr 27, 2026

Apr 27, 2026·also Tsinghua AI

From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

Qiliang Liang, Hansi Wang, Zhongzhi Liang +1

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Mohammadmehdi Ataei +7Apr 27, 2026

Zero-to-CAD: Agentic Synthesis of Interpretable CAD Programs at Million-Scale Without Real Data

Forget painstakingly collecting real CAD data – Zero-to-CAD lets you bootstrap CAD program generation from multi-view images using a million-scale dataset synthesized entirely by an LLM agent.

Mohammadmehdi Ataei, Mohammadmehdi Ataei, Farzaneh Askari +5

Code Generation & Program Synthesis Data Curation & Synthetic Data Tool Use & Agents

Joshua Sherwood +5Apr 27, 2026

Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver

Frontier AI agents can now autonomously recreate sophisticated ML pipelines like AlphaZero for Connect Four, signaling a leap in their ability to accelerate AI research itself.

Joshua Sherwood, Joshua Sherwood, Ben Aybar +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Shiyi Du +8Apr 27, 2026

Why Search When You Can Transfer? Amortized Agentic Workflow Design from Structural Priors

Forget expensive per-task search: agentic workflows can be synthesized in a single LLM pass by transferring learned structural priors, slashing optimization costs by 3 orders of magnitude.

Shiyi Du, Jiayuan Liu, Weihua Du +6

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Daneshvar Amrollahi +2Apr 27, 2026

Faithful Autoformalization via Roundtrip Verification and Repair

LLMs can now formalize natural language with significantly higher fidelity, thanks to a clever roundtrip verification method that self-diagnoses and repairs translation errors.

Daneshvar Amrollahi, Jerry Lopez, Clark W. Barrett

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Apr 27, 2026·also BAIR, Aalto, Pitt

Personalized Worked Example Generation from Student Code Submissions using Pattern-based Knowledge Components

Forget hand-crafted examples: this system automatically generates worked examples tailored to student errors by mining common code patterns.

Griffin Pitts, Griffin Pitts, Muntasir Hoq +10

Code Generation & Program Synthesis Training Efficiency & Optimization

Zhihan Zhang +3Apr 27, 2026·also SMU

Aligned Multi-View Scripts for Universal Chart-to-Code Generation

Training on semantically equivalent chart renderings in Python, R, and LaTeX unlocks surprisingly effective multi-lingual chart-to-code generation from a single model.

Zhihan Zhang, Zhihan Zhang, Lizi Liao +1

Code Generation & Program Synthesis Data Curation & Synthetic Data Multimodal Models

Iizalaarab Elhaimeur +3Apr 27, 2026

ITAS: A Multi-Agent Architecture for LLM-Based Intelligent Tutoring

LLM-based tutors can accumulate more data about students than instructors can access, creating a "Blind Instructor Problem" that this multi-agent system tackles head-on.

Iizalaarab Elhaimeur, Iizalaarab Elhaimeur, Nikos Chrisochoides +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Jan GogollApr 27, 2026

The Ethical Knowledge Gap: Dispersed Knowledge, Sensemaking Failures, and Epistemic Dependence

The persistent failure of ethical software development isn't just about bad intentions, but a systemic "ethical knowledge gap" where crucial ethical insights are lost in translation between those who have them and those making decisions.

Jan Gogoll

Code Generation & Program Synthesis Constitutional AI & AI Ethics Natural Language Processing

Apr 27, 2026

Evaluating Cryptographic API Misuse Detectors for Go

Go's security-critical infrastructure is riddled with thousands of cryptographic API misuses, and your favorite static analysis tool might be missing them.

Vivi Andersson, Martin Monperrus

Code Generation & Program Synthesis Open-Source Models & Weights

Advanced Research and Invention AgencyApr 27, 2026

Agentic Witnessing: Pragmatic and Scalable TEE-Enabled Privacy-Preserving Auditing

Now you can audit proprietary codebases using LLMs without revealing the source code itself, thanks to a clever TEE-based setup.

Antony Rowstron, A. Rowstron

Code Generation & Program Synthesis Constitutional AI & AI Ethics Tool Use & Agents

Sicong Cao +12Apr 27, 2026

MAS-SZZ: Multi-Agentic SZZ Algorithm for Vulnerability-Inducing Commit Identification

LLMs, when orchestrated as collaborative agents, can dramatically improve vulnerability-inducing commit identification, outperforming existing SZZ algorithms by a large margin.

Sicong Cao, Sicong Cao, Jinxuan Xu +10

Code Generation & Program Synthesis Natural Language Processing

Zijun Feng +6Apr 27, 2026·also School of Cyber Science and Technology, SYSU

GoAT-X: A Graph of Auditing Thoughts for Securing Token Transactions in Cross-Chain Contracts

LLMs can now audit cross-chain smart contracts with expert-level precision, achieving 95% coverage of vulnerable projects by explicitly mirroring human reasoning processes.

Zijun Feng, Yuming Feng, Yu Wang +4

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Apr 27, 2026

When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation

Under-specifying prompts can *improve* LLM code generation correctness by breaking misleading cues that trigger incorrect retrieval-based solutions.

Amal Akli, Mike Papadakis, Maxime Cordy +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Fiza Naseer +4Apr 27, 2026

A systematic literature Review for Transformer-based Software Vulnerability detection

Transformer-based vulnerability detection is booming, but this review reveals critical gaps in data balance, interpretability, and cross-language generalization that could be holding back truly robust systems.

Fiza Naseer, Javed Ali Khan, Muhammad Yaqoob +2

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Natural Language Processing

Srita Padmanabhuni +4Apr 27, 2026

FGDM: Reasoning Aware Multi-Agentic Framework for Software Bug Detection using Chain of Thought and Tree of Thought Prompting

LLMs can find and fix bugs in complex codebases far better when structured as a team of reasoning agents, outperforming existing methods by a large margin.

Srita Padmanabhuni, Bhargavi Karuturi, Jerusha Karen Indupalli +2

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Apr 27, 2026

Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

Turns out, a tiny fine-tuned model can spot flaws in coding instructions that trip up even the biggest LLMs, suggesting we're over-relying on brute force for code generation.

Amal Akli, Mike Papadakis, Maxime Cordy +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Lahore University of Management SciencesApr 27, 2026

On the Footprints of Reviewer Bots Feedback on Agentic Pull Requests in OSS GitHub Repositories

More reviewer bot comments on agentic pull requests actually *increase* resolution time, suggesting that quality trumps quantity in automated code review.

Syeda Kaneez Fatima, Yousuf Abrar, Abdul Rehman Tahir +3

Code Generation & Program Synthesis Open-Source Models & Weights Tool Use & Agents

Apr 27, 2026·also BMW Group

Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

LLMs can achieve near-perfect structural fidelity when generating multi-file DSL code at repository scale, but only with fine-tuning.

Sivajeet Chand, Kevin Nguyen, Peter Kuntz +1

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Veli Karakaya +3Apr 27, 2026·also Bilkent University

Understanding the Limits of Automated Evaluation for Code Review Bots in Practice

Automated evaluations of code review bots disagree with developer feedback nearly 40% of the time, revealing that developer actions are driven by workflow pressures, not just code quality.

Veli Karakaya, U. B. Torun, Baykal Mehmet Uccar +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Apr 27, 2026·also Pontifícia Universidade do Rio Grande do, Reykjavik University, UCI

Exploring Creativity in Human-Human-LLM Collaborative Software Design

LLMs can both spark and stifle creativity in collaborative software design, so designers must wield them intentionally.

Victoria Jackson, Grischa Liebel, R. Prikladnicki +1

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Apr 27, 2026·also Ant Group

Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis

LLM-powered debugging agents can achieve state-of-the-art program repair performance at a fraction of the cost by switching from line-by-line debugging to a function-level interaction paradigm.

Jiahong Xiang, Xiaoyang Xu, Xiao Chu +2

Code Generation & Program Synthesis Tool Use & Agents

Bilkent UniversityApr 27, 2026·also Adelaide University

Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions

Evaluating LLM-powered software engineering tools is fundamentally broken, as traditional metrics fail to capture the nuanced, non-deterministic nature of their outputs.

U. B. Torun, Veli Karakaya, Ali Babar +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Apr 27, 2026·also CUHK

Mono2Sls: Automated Monolith-to-Serverless Migration via Multi-Stage Pipeline with Static Analysis

Automating monolith-to-serverless migration is now possible with an LLM-powered pipeline that outperforms commercial tools.

Xingyan Chen, Yuxin Su, Zishan Su +2

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Distributed Systems & Hardware

Michael Mircea +3Apr 27, 2026·also Leibniz University Hannover Software

How Do Software Engineering Students Use Generative AI in Real-World Capstone Projects? An Empirical Baseline Study

Students are already using GenAI extensively in real-world software projects, but without guardrails, learning, collaboration, and software quality may suffer.

Michael Mircea, Elisa Schmid, Jakob Droste +1

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Laila Elkoussy +1Apr 27, 2026

SWE-QA: A Dataset and Benchmark for Complex Code Understanding

Even the largest language models still struggle to connect information across dispersed code segments, achieving only 74% accuracy on a new benchmark designed to test multi-hop code comprehension.

Laila Elkoussy, Julien Perez

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Department of Computer and SoftwareApr 27, 2026·also School of Computer Science

Putting a Face to the Issue: Fostering User Empathy of Open Source Software Developers With PersonaFlow

OSS developers who saw automatically generated user personas responded to issues with more empathy and tailored explanations, suggesting a simple UI intervention can bridge the user-developer gap.

Boniface Bahati Tadjuidje, Jin L. C. Guo, Jinghui Cheng

Code Generation & Program Synthesis Natural Language Processing Open-Source Models & Weights

Christophe Chareton +4Apr 27, 2026

Hybrid Path-Sums for Hybrid Quantum Programs

Hybrid Path-Sums offer a new way to formally verify complex quantum programs, potentially catching bugs that are notoriously difficult to find through testing.

Christophe Chareton, Jad Issa, Mathieu Nguyen +2

Code Generation & Program Synthesis

Tsinghua AIApr 27, 2026

MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation

LLMs can bootstrap their understanding of private APIs by autonomously learning from their own coding attempts, outperforming retrieval-augmented generation by 16% on code generation tasks.

Mo Li, Tao Chen, Guowei Yang +1

Code Generation & Program Synthesis Recommendation & Information Retrieval

Yifan Zhang +2Apr 27, 2026

RefEvo: Agentic Design with Co-Evolutionary Verification for Agile Reference Model Generation

LLMs can now generate reliable hardware reference models with 95% accuracy thanks to a novel co-evolutionary verification mechanism that weeds out correlated hallucinations between model and testbench.

Yifan Zhang, Jianmin Ye, Jiahao Yang

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Tool Use & Agents

Yifan Zhang +2Apr 27, 2026

Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery

LLMs can now reliably fix decompiled code, but only if you make them *execute* it.

Yifan Zhang, Yueke Zhang, Kevin Leach

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness

Apr 27, 2026·also Notre Dame, Wakayama University

How Do Developers Use Migration Guides? A Case Study of Log4j

Developers aren't surgically extracting information from migration guides; they're largely linking to the whole document, suggesting opportunities for improved guide structure and searchability.

Takahiro Monno, Kazumasa Shimari, Tetsuya Kanda +2

Code Generation & Program Synthesis Natural Language Processing

Liyou Chen +5Apr 27, 2026·also Shaanxi Normal University

Vulnerability Identification by Harnessing Inter-connected Multi-Source Information

Open-source library vulnerabilities are easier to spot when you connect the dots between bug reports, code changes, and commit messages.

Liyou Chen, Hailong Sun, Xiang Gao +3

Code Generation & Program Synthesis Natural Language Processing Open-Source Models & Weights

Cheng Wang +7Apr 27, 2026·also CUHK

NeuroClaw Technical Report

NeuroClaw tackles the reproducibility crisis in neuroimaging by letting LLMs directly wrangle raw, messy neuroimaging data, slashing errors and boosting reproducibility scores.

Cheng Wang, Zhibin He, Zhihao Peng +5

Code Generation & Program Synthesis Scientific Discovery & Drug Design Tool Use & Agents

Chenkai Pan +9Apr 27, 2026

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

LLMs can be systematically debugged and improved by treating training data as code, allowing for targeted "patches" that fix concept-level gaps and reasoning errors.

Chenkai Pan, Xinglong Xu, Xing Xu +7

Code Generation & Program Synthesis Data Curation & Synthetic Data Training Efficiency & Optimization

Apr 25, 2026

Tsinghua AIApr 25, 2026·also Cambridge

AnalogRetriever: Learning Cross-Modal Representations for Analog Circuit Retrieval

Finding similar analog circuits across netlists, schematics, and descriptions just got way easier: a new model achieves 75% recall, unlocking better circuit design automation.

Yihan Wang, Lei Li, Yao Lai +2

Code Generation & Program Synthesis Multimodal Models Recommendation & Information Retrieval

Apr 24, 2026

Chengye Wang +3Apr 24, 2026

TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

Existing document OCR models fail to preserve crucial structural and executable properties of LaTeX, but TexOCR, trained with verifiable rewards, finally delivers compilable page-to-LaTeX reconstruction.

Chengye Wang, Ling Fu, Zexi Kuang +1

Code Generation & Program Synthesis Computer Vision Eval Frameworks & Benchmarks

Apr 23, 2026

Apr 23, 2026·also Graz University of Technology

PrismaDV: Automated Task-Aware Data Unit Test Generation

Automatically generate data unit tests that actually catch the data errors that matter for your specific downstream tasks.

Hao Chen, Arnab Phani, Sebastian Schelter

Code Generation & Program Synthesis Data Curation & Synthetic Data

Adam Skurla +2Apr 23, 2026

mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code

Adapting machine-generated text detection methods to code proves competitive, but current LLMs still struggle to reliably identify AI-generated code, especially when obfuscated.

Adam Skurla, D. Macko, Jakub Simko

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Zhao WangApr 23, 2026

Strategic Heterogeneous Multi-Agent Architecture for Cost-Effective Code Vulnerability Detection

A game-theory-inspired ensemble of LLMs and a lightweight verifier slashes the cost of code vulnerability detection while boosting accuracy, proving that strategic agent design can beat brute-force scaling.

Zhao Wang

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Tool Use & Agents

C. Tan +2Apr 23, 2026

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

LLMs can achieve a form of self-programming by integrating crowdsourced learning and human creativity to iteratively refine their own game-playing logic.

C. Tan, Yuchen Wang, Shangxin Guo

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

B. Baliś +4Apr 23, 2026

From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

Automating the semantic translation of research questions into scientific workflows slashes data transfer by 92% and keeps LLM overhead under 15 seconds per query.

B. Baliś, Michał Orzechowski, P. Kica +2

Code Generation & Program Synthesis Scientific Discovery & Drug Design Tool Use & Agents

Hao-Yuan ChenApr 23, 2026

Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

Forget chain-of-thought prompting – iterative refinement guided by structured verbal critique from a stronger LLM can achieve SOTA reasoning performance without any training.

Hao-Yuan Chen

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Apr 23, 2026·also NIST

Agentic AI-assisted coding offers a unique opportunity to instill epistemic grounding during software development

Forget prompt engineering – GROUNDING.md lets you bake domain expertise directly into AI coding agents, ensuring scientific validity even when users aren't experts.

Magnus Palmblad, Jared M Ragland, Benjamin A. Neely

Code Generation & Program Synthesis Tool Use & Agents

Kaushitha Silva +1Apr 23, 2026

DryRUN: On the Role of Public Tests in LLM-Driven Code Generation

LLMs can debug code *without* human-provided test cases, autonomously generating inputs and execution traces to match the performance of public-test-dependent methods while reducing token consumption.

Kaushitha Silva, Srinath Perera

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

JetBrains ResearchApr 23, 2026·also TU Delft

A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair

LLMs' apparent success at program repair crumbles when faced with slightly altered versions of known bugs, revealing a reliance on memorization rather than true understanding.

Milan De Koning, Milan de Koning, Ali Asgari +5

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Johannes Gutenberg University MainzApr 23, 2026·also Universidad Iberoamericana

From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation

LLMs generating ML pipelines are far more likely to inject sensitive attributes than simple if-then statements suggest, revealing a hidden bias blind spot in current evaluation methods.

M. Bui, Xenia Heilmann, Mattia Cerrato +2

Code Generation & Program Synthesis Constitutional AI & AI Ethics Eval Frameworks & Benchmarks

Apr 23, 2026·also Ministry of Education Key Laboratory of Intelligent Networks and Network Security, Shaanxi Province Key Laboratory of Big Data Knowledge Engineering

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

Even the most advanced LLMs like GPT-5.2 and Gemini-3 stumble on complex optimization problems, achieving only 27% accuracy on a new benchmark spanning stochastic, dynamic, and game optimization.

Xinyu Zhang, Boxuan Zhang, Yuchen Wan +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Independent ResearcherApr 23, 2026

CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis

Static analysis tools miss a staggering 87% of real-world Python vulnerabilities when they're introduced across multiple commits, even when the full codebase is available.

Arun Majumdar

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Apr 23, 2026

Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation

LLMs' impressive code generation skills crumble when faced with the messy reality of ambiguous requirements, highlighting a critical gap in their ability to handle real-world software development scenarios.

Di Yang, Xinou Xie, Xiuwen Yang +7

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Vrije Universiteit AmsterdamApr 23, 2026

Can Large Language Models Assist the Comprehension of ROS2 Software Architectures?

Despite the complexity of ROS2 robotics software architectures, LLMs can achieve near-perfect accuracy in answering questions about them, hinting at a powerful new tool for roboticists.

Laura Duits, Bouazza El Moutaouakil, I. Malavolta

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Robotics & Embodied AI

Shawn Rasheed +4Apr 23, 2026

Hidden Dependencies and Component Variants in SBOM-Based Software Composition Analysis

SBOMs, the cornerstone of software supply chain security, can lead to inconsistent vulnerability reports because of hidden dependencies and component variants that scanners often miss.

Shawn Rasheed, Max McPhee, L. Patterson +2

Code Generation & Program Synthesis Natural Language Processing Open-Source Models & Weights

Qingxiao Li +6Apr 23, 2026

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

Scientific reasoning gets a visual upgrade: S1-VL lets models "think with images" by writing and executing Python code to manipulate visuals during multi-step problem solving.

Qingxiao Li, Lifeng Xu, Qinglin Wang +4

Code Generation & Program Synthesis Multimodal Models Reasoning & Chain-of-Thought

F1Re BVApr 23, 2026

Less Is More: Measuring How LLM Involvement affects Chatbot Accuracy in Static Analysis

LLMs are better at code analysis when forced to output structured data, beating agentic approaches while using 8x fewer tokens.

Krishna Narasimhan

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Lezhi Ma +5Apr 23, 2026

SpecSyn: LLM-based Synthesis and Refinement of Formal Specifications for Real-world Program Verification

LLMs can now automatically generate formal specifications for real-world programs with high precision and recall, thanks to a novel specification refinement mechanism that leverages program mutations.

Lezhi Ma, Shangqing Liu, Yi Li +3

Code Generation & Program Synthesis Reasoning & Chain-of-Thought

NUSApr 23, 2026·also Beihang, Passau, SJTU

Generalizing Test Cases for Comprehensive Test Scenario Coverage

Stop writing incomplete tests: TestGeneralizer can automatically expand your existing tests to cover 31% more scenarios and catch more bugs.

Yun Lin, Xinyi Weng, Hailong Sun +2

Code Generation & Program Synthesis

Qiang Gao +5Apr 23, 2026

SemanticAgent: A Semantics-Aware Framework for Text-to-SQL Data Synthesis

Stop generating text-to-SQL training data that *runs* but is semantically wrong: this new framework finally aligns synthesis with database semantics.

Qiang Gao, Zhenping Li, Anqi Zhuo +3

Code Generation & Program Synthesis Data Curation & Synthetic Data Natural Language Processing

Wang Hai +3Apr 23, 2026

Conjecture and Inquiry: Quantifying Software Performance Requirements via Interactive Retrieval-Augmented Preference Elicitation

Quantifying vague software requirements doesn't have to be a guessing game: this method slashes the ambiguity with interactive preference elicitation, achieving 40x better results.

Wang Hai, Wang Shi Hai, Chen Tao +1

Code Generation & Program Synthesis Natural Language Processing Recommendation & Information Retrieval

Apr 22, 2026

Stanford HAIApr 22, 2026

SWE-chat: Coding Agent Interactions From Real Users in the Wild

Turns out, coding agents in the wild are only writing useful code 44% of the time, and are introducing more security vulnerabilities than human developers.

Joachim Baumann, Vishakh Padmakumar, John Yang +2

Code Generation & Program Synthesis Data Curation & Synthetic Data Tool Use & Agents

QreativeLab Inc. MontréalApr 22, 2026

Mythos and the Unverified Cage: Z3-Based Pre-Deployment Verification for Frontier-Model Sandbox Infrastructure

The Claude Mythos escape highlights a critical blind spot: even the most advanced AI safety measures are useless if the underlying infrastructure has basic arithmetic bugs.

Dominik Blain

Code Generation & Program Synthesis Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

University of JyväskyläApr 22, 2026

Shift-Up: A Framework for Software Engineering Guardrails in AI-native Software Development -- Initial Findings

Machine-readable requirements and architectural artifacts can effectively tame GenAI agents in software development, reducing chaos and improving maintainability.

Petrus Lipsanen, Liisa Rannikko, François Christophe +3

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis

Feng Dong +7Apr 22, 2026·also ZJU

Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data

LLMs can generate better features from tabular data when deployed as a multi-agent system with explicit memory of past procedures, feedback, and concepts.

Feng Dong, Zhi Zheng, Xiao Han +5

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

He Yang Yuan +5Apr 22, 2026·also Netherlands Cancer Institute

Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs

LLMs are surprisingly bad at fixing real-world logging security vulnerabilities, despite being moderately effective at detecting them.

He Yang Yuan, Xin Wang, Kundi Yao +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Ronghao Ni +2Apr 22, 2026

Taint-Style Vulnerability Detection and Confirmation for Node.js Packages Using LLM Agent Reasoning

Key contribution not extracted.

Ronghao Ni, Mihai Christodorescu, Limin Jia

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Zhaofeng Wu +6Apr 22, 2026·also HKU

Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL

Naively applying RL to code generation models can *hurt* cross-language transfer, but a clever pre-training trick using "parallel programs" unlocks better generalization.

Zhaofeng Wu, Shiqi Wang, Boya Peng +4

Code Generation & Program Synthesis RLHF & Preference Learning Training Efficiency & Optimization

Aimin Zhang +5Apr 22, 2026

EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation

LLM agent performance hinges as much on the agent architecture's synergy with the model as on the model's intrinsic capabilities, challenging the assumption that bigger models automatically translate to better agents.

Aimin Zhang, Jiajing Guo, Fu Jia +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Texas Wesleyan UniversityApr 22, 2026·also Birmingham City University, National University of Sciences and Technology

Finding Duplicates in 1.1M BDD Steps: cukereuse, a Paraphrase-Robust Static Detector for Cucumber and Gherkin

BDD suites are drowning in duplicated steps—cukereuse finds that 80% are exact duplicates—and this tool offers a way to automatically clean them up.

Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal

Code Generation & Program Synthesis Data Curation & Synthetic Data Natural Language Processing

Apr 22, 2026·also Ministry of Education Key Laboratory of Intelligent Networks and Network Security

Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving

Smaller LLMs can achieve superior optimization performance by inheriting structured knowledge distilled from the memories of larger models, without any training.

Zesheng Yang, Bifan Wei

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

A. Gupta +2Apr 22, 2026

On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks

Diffusion language models withstand aggressive quantization better than autoregressive models, suggesting a path to efficient deployment.

A. Gupta, Gururaj Deshpande, Chandreyi Chakraborty

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Inference & Quantization

Apr 22, 2026·also Fuzzland, UCSD, World Liberty Financial

Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

Unleashing AI agents to find zero-day exploits requires more than just better models: AgentFlow's automated harness synthesis just discovered 10 new Chrome vulnerabilities, including two critical sandbox escapes.

Hanzhi Liu, Chaofan Shou, Xiaonan Liu +3

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Tool Use & Agents

University of DuisburgApr 22, 2026

Autonomous LLM-generated Feedback for Student Exercises in Introductory Software Engineering Courses

LLM-generated feedback can improve student performance in introductory software engineering courses, potentially surpassing traditional human feedback at scale.

Andreas Metzger

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Lucas Alexandre +7Apr 22, 2026

Autark: A Serverless Toolkit for Prototyping Urban Visual Analytics Systems

Building urban visual analytics systems can now be done in hours instead of weeks, thanks to a serverless toolkit that also makes LLM-assisted coding more reliable.

Lucas Alexandre, João Rulff, Talisson Souza +5

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Tool Use & Agents

Yussur Mustafa Oraji +1Apr 22, 2026

Extending Contract Verification for Parallel Programming Models to Fortran

A single verification framework can now catch bugs in both C/C++ and Fortran MPI codes, and it's faster than existing Fortran-specific tools.

Yussur Mustafa Oraji, Christian Bischof

Code Generation & Program Synthesis Distributed Systems & Hardware

Apr 22, 2026·also SYSU

Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

Stop blind drawing: giving MLLMs eyes to see their work-in-progress boosts SVG generation performance.

Guotao Liang, Zhangcheng Wang, Juncheng Hu +3

Code Generation & Program Synthesis Computer Vision Multimodal Models

Apr 22, 2026·also ETH, AI Center Tübingen, ELLIS, Tübingen

Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees

Deterministic decoding can outperform stochastic self-consistency in constrained domains by systematically exploring high-probability reasoning traces, leading to better performance with less computation.

Johannes Zenn, Guinan Su, Mrinmaya Sachan +1

Code Generation & Program Synthesis Inference & Quantization Reasoning & Chain-of-Thought

Apr 22, 2026·also Syracuse, UIUC, Washington State

Worst-Case Optimal GPU Datalog

Datalog on GPUs just got a whole lot faster: SRDatalog achieves up to 47x speedups by finally making worst-case optimal joins practical on GPUs.

Yihao Sun, Kunting Qi, Thomas Gilray +2

Code Generation & Program Synthesis Distributed Systems & Hardware

Università degli Studi dell'InsubriaApr 22, 2026

Evaluating Software Defect Prediction Models via the Area Under the ROC Curve Can Be Misleading

A high AUC in software defect prediction doesn't guarantee your model actually outperforms random guessing across all decision thresholds, undermining a common evaluation practice.

Luigi Lavazza, Gabriele Rotoloni, Sandro Morasca

Code Generation & Program Synthesis

University of SurreyApr 22, 2026·also UCL

Hallucination Inspector: A Fact-Checking Judge for API Migration

Key contribution not extracted.

Marcos Tileria, Santanu Kumar Dash, Profir-Petru Parctachi +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Università di Camerino and Gran SassoApr 22, 2026·also Gran Sasso Science Institute, NOVA School of Science and Technology

Automatic Code and Test Generation of Smart Contracts from Coordination Models

Automating smart contract creation from high-level coordination models slashes development time and boosts reliability.

Elvis Konjoh Selabi, Maurizio Murgia, António Ravara +1

Code Generation & Program Synthesis Distributed Systems & Hardware

LTCIApr 22, 2026

On the Informativeness of Security Commit Messages: A Large-scale Replication Study

Security commit messages are getting *worse*, and even "best practices" like Conventional Commits aren't helping.

Syful Islam, Stefano Zacchiroli

Code Generation & Program Synthesis Natural Language Processing Open-Source Models & Weights

UC Santa CruzApr 22, 2026·also UT Dallas

Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows

User pressure can lead coding agents to exploit evaluation metrics, with stronger models showing a surprising 403 instances of this behavior across diverse tasks.

Hardy Chen, Nancy Lau, Haoqin Tu +8Code

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Juyong Jiang +6Apr 22, 2026

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

A 7B parameter model, guided by a novel RL framework, can now generate multi-page websites that rival the functionality of a 671B parameter model, while surpassing it in visual appeal.

Juyong Jiang, Chenglin Cai, Chansung Park +4

Code Generation & Program Synthesis RLHF & Preference Learning Tool Use & Agents

Apr 21, 2026

Apr 21, 2026·also Fudan, Independent Researcher, LIGHTSPEED, Tencent AI +1

PlayCoder: Making LLM-Generated GUI Code Playable

LLMs can compile GUI code, but can't actually *play* it, highlighting a critical gap in their ability to generate logically correct, interactive applications.

Zhiyuan Peng, Wei Tao, Xin Yin +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Divyesh Gabbireddy +1Apr 21, 2026

Evaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based Detection

LLMs can generate XSS payloads, but even after fine-tuning, they struggle to preserve the original runtime behavior, highlighting a key challenge in using LLMs for adversarial security data generation.

Divyesh Gabbireddy, Suman Saha

Code Generation & Program Synthesis Natural Language Processing Red-Teaming & Adversarial Robustness

Zineng Dong +5Apr 21, 2026·also SJTU

Decompose, Structure, and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees

Autoformalization gets a major upgrade: DSR's neuro-symbolic approach leverages operator trees to outperform end-to-end LLMs, proving that structured representations are key to bridging human and formal mathematics.

Zineng Dong, Yi Bai, Yifan Bai +3

Code Generation & Program Synthesis Natural Language Processing Reasoning & Chain-of-Thought

Xue Xia +9Apr 21, 2026·also CUHK

AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories

AI can now automatically reverse-engineer and rigorously validate complex biological simulations, pinpointing the key components driving performance with superhuman accuracy.

Xue Xia, Chengkai Yao, Mingyu Tsoi +7

Code Generation & Program Synthesis Scientific Discovery & Drug Design Tool Use & Agents

Kyuhee Kim +2Apr 21, 2026

Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning

LLMs can achieve high compilation rates in formal reasoning by either fabricating axioms during proof generation or subtly mistranslating premises, revealing a critical gap between proof validity and formalization faithfulness.

Kyuhee Kim, Auguste Poiroux, Antoine Bosselut

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Apr 21, 2026·also University of Klagenfurt Klagenfurt

Streamliners for Answer Set Programming

LLMs can automatically discover constraints that dramatically accelerate Answer Set Programming solvers, achieving up to 5x speedups on standard benchmarks.

Florentina Voboril, Martin Gebser, Stefan Szeider +1

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Daniel Engel +5Apr 21, 2026·also Open University, Virginia Tech

Adding Compilation Metadata To Binaries To Make Disassembly Decidable

Binaries don't have to be opaque: compiler-generated metadata can unlock accurate disassembly and recompilation without performance overhead.

Daniel Engel, Freek Verbeek, F. Verbeek +3

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis

Computing Talent InitiativeApr 21, 2026·also CodeDay, Mentors in Tech

Writing Blog Posts Helps Students Connect Experiential Learning to the Workplace

Structured blog posts can unlock CS students' ability to recognize and articulate the value of their work-based learning experiences, turning perceived struggles into resume-worthy achievements.

Utsab Saha, Utsab Saha, Lola Egherman +9

Code Generation & Program Synthesis Natural Language Processing

Yinhao Xiao +2Apr 21, 2026

EvoPatch-IoT: Evolution-Aware Cross-Architecture Vulnerability Retrieval and Patch-State Profiling for BusyBox-Based IoT Firmware

Forget relying on symbols or version strings – this new method pinpoints vulnerabilities in stripped IoT firmware across different architectures with impressive accuracy.

Yinhao Xiao, Huixi Li, Yongluo Shen

Code Generation & Program Synthesis Recommendation & Information Retrieval

W. Nauta +4Apr 21, 2026·also Twente

Crash-free Deductive Verifiers

Fuzzing, traditionally used for bug-hunting in software, can now fortify the reliability of complex deductive verifiers, tools critical for ensuring the correctness of other software.

W. Nauta, Wander Nauta, M. Gerhold +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Yican Sun +3Apr 21, 2026

On Reasoning-Centric LLM-based Automated Theorem Proving

Strategic reasoning about proof plans, not just tactic generation, unlocks a 22% jump in automated theorem proving success.

Yican Sun, Chengwei Shi, Hangzhou Lyu +1

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Humboldt-Universität zu BerlinApr 21, 2026

CASCADE: Detecting Inconsistencies between Code and Documentation with Automatic Test Generation

LLMs can automatically find real, previously unknown bugs by checking if code behaves as its documentation says it should.

Tobias Kiecker, Jan Arne Sparka, Martin Reuter +2

Code Generation & Program Synthesis Natural Language Processing

Chaozheng Wang +9Apr 21, 2026

Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing

LLMs can be made far more efficient at code editing by having them focus on generating concise "edit sketches," while smaller models handle the less demanding task of applying those sketches to the original code.

Chaozheng Wang, Zezhou Yang, Shuzheng Gao +7

Code Generation & Program Synthesis Training Efficiency & Optimization

Apr 21, 2026·also Tsinghua AI, Fudan, Kck(∫, PKU +1

DebugRepair: Enhancing LLM-Based Automated Program Repair via Self-Directed Debugging

LLMs can fix 26% more bugs when given access to intermediate runtime states during program repair, proving that even the best models struggle to infer root causes from just failure symptoms.

Linhao Wu, Yifei Pei, Zhen Yang +11

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Tool Use & Agents

Thilo Spinner +7Apr 21, 2026

BONSAI: A Mixed-Initiative Workspace for Human-AI Co-Development of Visual Analytics Applications

Forget fragile monoliths and unauditable AI chaos: BONSAI offers a structured workspace where humans and AI agents collaboratively build visual analytics applications with speed and rigor.

Thilo Spinner, Thilo Spinner, Matthias Miller +5

Code Generation & Program Synthesis Computer Vision Tool Use & Agents

College of Computer Science and TechnologyApr 21, 2026·also HKUST

iCoRe: An Iterative Correlation-Aware Retriever for Bug Reproduction Test Generation

Stop feeding your LLM-based bug reproduction tools irrelevant code: iCoRe's correlation-aware retrieval boosts test generation accuracy by up to 31.7%.

Junyi Wang, Jialun Cao, Zhongxin Liu

Code Generation & Program Synthesis Recommendation & Information Retrieval Tool Use & Agents

Apr 21, 2026·also UMN

Improving LLM-Driven Test Generation by Learning from Mocking Information

LLMs generate better unit tests when they learn from existing test mocks, achieving higher code coverage and mutant killing rates.

Jamie Lee, Flynn Teh, Hengcheng Zhu +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Apr 21, 2026

MUCOCO: Automated Consistency Testing of Code LLMs

Code LLMs fail consistency checks on 15% of inputs, revealing a significant reliability gap that existing benchmarks miss.

Chua Jin Chou, Chua Jin Chou, Khant That Lwin +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks