UIUCMay 26, 2026arXiv:2605.26548

SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

Hwiwon Lee, Dongjun Kim, Ziqi Zhang, Chun Xia, Chunqiu Steven Xia, Lingming Zhang

AI Summary

SEC-bench Pro is introduced as a new benchmark for evaluating LLMs on real-world bug hunting scenarios in complex software systems like V8 and SpiderMonkey, addressing limitations of existing benchmarks that rely on fuzzing harnesses or target-specific descriptions. The benchmark includes 183 validated vulnerabilities with PoC inputs and links to fixes, covering memory-safety, sandbox, and JIT bugs. Evaluations using SEC-bench Pro reveal that even frontier models struggle, achieving below 40% success, although a two-agent union of ClaudeCode and Codex reaches nearly 50% on SpiderMonkey, highlighting the need for improvements in long-horizon bug hunting capabilities.

Key Contribution

LLMs still miss over 60% of real-world bugs in critical software like V8 and SpiderMonkey, even when given access to PoCs and fixes, revealing a significant gap in their ability to automate complex security tasks.

Abstract

Large language models (LLMs) now support automated software security tasks, including vulnerability discovery and proof-of-concept (PoC) generation. Existing benchmarks do not faithfully evaluate LLMs in real-world bug hunting scenarios because they rely on fuzzing harnesses, target-specific descriptions, or vulnerability-reproduction tasks. We present SEC-bench Pro, a benchmark for measuring agent bug hunting on critical, high-complexity software systems. This work discloses reports with concrete PoC inputs and links fixes into reproducible tasks through a three-phase pipeline for vulnerability collection, environment reconstruction, and oracle-based validation. We instantiate SEC-bench Pro with 183 validated vulnerabilities across V8 and SpiderMonkey, including a V8 subset with more than $1.5 million in cumulative Google Vulnerability Reward Program awards. These instances span memory-safety, sandbox, JIT, and race-condition bugs under browser-grade and runtime-grade execution conditions. Our evaluation shows that coding agents with frontier models remain below 40% success on both evaluated engines. The open-weight Kimi-K2.6 baseline reaches 11.7% on V8, while the strongest frontier configuration reaches 32.0% on V8 and 38.8% on SpiderMonkey. ClaudeCode and Codex solve complementary instance sets, and their two-agent union reaches 37.9% on V8 and 48.8% on SpiderMonkey. SEC-bench Pro provides robust environments for assessing LLM-based security agents and exposes limitations in long-horizon bug hunting tasks.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References42

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

Related Papers