Search papers, labs, and topics across Lattice.
The paper introduces AnyPoC, a multi-agent framework designed to automatically generate proof-of-concept (PoC) tests for bug reports identified by LLMs, enabling scalable validation of potential software defects. AnyPoC employs a three-stage process involving analysis/fact-checking, iterative PoC synthesis/execution with trace collection, and independent re-execution/scrutiny to mitigate hallucination and reward hacking. Applied to 12 large software systems, AnyPoC significantly outperforms existing coding agents, leading to the discovery of 122 new bugs and the adoption of 45 generated PoCs as official regression tests.
LLMs can now autonomously validate their own bug reports with 1.3x higher accuracy and 9.8x lower false positives, thanks to a novel multi-agent framework that synthesizes, executes, and scrutinizes proof-of-concept tests.
While recent LLM-based agents can identify many candidate bugs in source code, their reports remain static hypotheses that require manual validation, limiting the practicality of automated bug detection. We frame this challenge as a test generation task: given a candidate report, synthesizing an executable proof-of-concept test, or simply a PoC - such as a script, command sequence, or crafted input - to trigger the suspected defect. Automated PoC generation can act as a scalable validation oracle, enabling end-to-end autonomous bug detection by providing concrete execution evidence. However, naive LLM agents are unreliable validators: they are biased toward "success" and may reward-hack by producing plausible but non-functional PoCs or even hallucinated traces. To address this, we present AnyPoC, a general multi-agent framework that (1) analyzes and fact-checks a candidate bug report, (2) iteratively synthesizes and executes a PoC while collecting execution traces, and (3) independently re-executes and scrutinizes the PoC to mitigate hallucination and reward hacking. In addition, AnyPoC also continuously extracts and evolves a PoC knowledge base to handle heterogeneous tasks. AnyPoC operates on candidate bug reports regardless of their source and can be paired with different bug reporters. To demonstrate practicality and generality, we apply AnyPoC, with a simple agentic bug reporter, on 12 critical software systems across diverse languages/domains (many with millions of lines of code) including Firefox, Chromium, LLVM, OpenSSL, SQLite, FFmpeg, and Redis. Compared to the state-of-the-art coding agents, e.g., Claude Code and Codex, AnyPoC produces 1.3x more valid PoCs for true-positive bug reports and rejects 9.8x more false-positive bug reports. To date, AnyPoC has discovered 122 new bugs (105 confirmed, 86 already fixed), with 45 generated PoCs adopted as official regression tests.