UWMay 5, 2026arXiv:2605.03546

ProgramBench: Can Language Models Rebuild Programs From Scratch?

John Yang, K. Lieret, J. Ma, Parth Thakkar, D. Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, R. Hou, Gabriel Synnaeve, Diyi Yang, Ofir Press

AI Summary

ProgramBench is introduced as a benchmark to evaluate the ability of software engineering agents to rebuild programs from scratch given only the program and its documentation. The benchmark contains 200 tasks ranging from CLI tools to complex software like FFmpeg and SQLite, using agent-driven fuzzing for end-to-end behavioral testing. Evaluation of 9 LLMs reveals that none fully resolve any task, with the best model achieving 95% test accuracy on only 3% of tasks, and models tend to generate monolithic implementations.

Key Contribution

LLMs can't rebuild software from scratch, even for widely used programs like FFmpeg and SQLite, revealing a critical gap in their ability to make high-level software architecture decisions.

Abstract

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ProgramBench: Can Language Models Rebuild Programs From Scratch?

Related Papers