CASNanyang Normal UniversityMar 17, 2026arXiv:2603.17104

When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

AI Summary

This paper introduces SLUMP, a benchmark to evaluate coding agents' faithfulness to emergent specifications in long-horizon coding tasks, where the full task is progressively disclosed through interaction. They measure Faithfulness Loss Under eM ergent s Pecification (SLUMP) by comparing implementation fidelity under emergent vs. single-shot specifications, using 20 ML papers and a five-level component-faithfulness rubric. Experiments with Claude Code and Codex reveal significant performance degradation under emergent specifications, particularly in structural integration for both and semantic faithfulness for Claude Code, which is partially mitigated by their proposed ProjectGuard.

Key Contribution

Coding agents struggle to maintain faithfulness to specifications that emerge gradually over long interactions, losing significant implementation fidelity compared to single-shot specifications.

Abstract

Current coding-agent benchmarks usually pro- vide the full task specification upfront. Real research coding often does not: the intended system is progressively disclosed through in- teraction, requiring the agent to track durable design commitments across a long session. We introduce a benchmark for this setting and study faithfulne Ss Loss U nder eM ergent s Pecification (SLUMP), defined as the reduc- tion in final implementation faithfulness un- der emergent specification relative to a single- shot specification control. The benchmark con- tains 20 recent ML papers (10 ICML 2025, 10 NeurIPS 2025), 371 atomic verifiable compo- nents, and interaction scripts of approximately 60 coding requests that progressively disclose the target design without revealing the paper itself. Final repositories are scored with a five-level component-faithfulness rubric and accompanied by an exposure audit to verify that scored components are recoverable from the visible interaction. Evaluated on Claude Code and Codex, the single-shot specification control achieves higher overall implementation fidelity on 16/20 and 14/20 papers, respectively. Structural integration degrades under emergent specification on both platforms, while seman- tic faithfulness loss is substantial on Claude Code and small on Codex. As a mitigation case study, we introduce ProjectGuard, an exter- nal project-state layer for specification tracking. On Claude Code, ProjectGuard recovers 90% of the faithfulness gap, increases fully faith- ful components from 118 to 181, and reduces severe failures from 72 to 49. These results identify specification tracking as a distinct eval- uation target for long-horizon coding agents.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

Related Papers