Search papers, labs, and topics across Lattice.
ProgramBench is introduced as a benchmark to evaluate the ability of software engineering agents to rebuild programs from scratch given only the program and its documentation. The benchmark contains 200 tasks ranging from CLI tools to complex software like FFmpeg and SQLite, using agent-driven fuzzing for end-to-end behavioral testing. Evaluation of 9 LLMs reveals that none fully resolve any task, with the best model achieving 95% test accuracy on only 3% of tasks, and models tend to generate monolithic implementations.
LLMs can't rebuild software from scratch, even for widely used programs like FFmpeg and SQLite, revealing a critical gap in their ability to make high-level software architecture decisions.
Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.