Search papers, labs, and topics across Lattice.
This paper systematically compares four LLM-based architectures for automated patching: fixed workflow, single-agent, multi-agent, and general-purpose code agents, evaluating them on patch correctness, efficiency, and failure modes. The study reveals that general-purpose code agents achieve the best patching performance due to their adaptable tool interfaces, while multi-agent systems suffer from high overhead and reasoning drift. The authors demonstrate that architectural design choices are more critical than raw LLM capabilities in determining the reliability and cost of automated patching.
Forget bigger models, better prompting, or more training data: the *architecture* of your LLM-based patching system is the real key to success, with general-purpose code agents surprisingly outperforming specialized multi-agent systems.
Large language models (LLMs) have shown promise for automated patching, but their effectiveness depends strongly on how they are integrated into patching systems. While prior work explores prompting strategies and individual agent designs, the field lacks a systematic comparison of patching architectures. In this paper, we present a controlled evaluation of four LLM-based patching paradigms -- fixed workflow, single-agent system, multi-agent system, and general-purpose code agents -- using a unified benchmark and evaluation framework. We analyze patch correctness, failure modes, token usage, and execution time across real-world vulnerability tasks. Our results reveal clear architectural trade-offs: fixed workflows are efficient but brittle, single-agent systems balance flexibility and cost, and multi-agent designs improve generalization at the expense of substantially higher overhead and increased risk of reasoning drift on complex tasks. Surprisingly, general-purpose code agents achieve the strongest overall patching performance, benefiting from general-purpose tool interfaces that support effective adaptation across vulnerability types. Overall, we show that architectural design and iteration depth, rather than model capability alone, dominate the reliability and cost of LLM-based automated patching.