UIUCMar 21, 2025arXiv:2503.17332

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, Daniel Kang

AI Summary

The paper introduces CVE-Bench, a novel benchmark designed to evaluate the ability of LLM agents to exploit real-world web application vulnerabilities identified as critical-severity Common Vulnerabilities and Exposures (CVEs). CVE-Bench addresses the limitations of existing benchmarks by providing a sandboxed environment that simulates real-world conditions for LLM agents to interact with vulnerable web applications. Experiments using CVE-Bench reveal that state-of-the-art LLM agent frameworks can successfully exploit up to 13% of the included vulnerabilities.

Key Contribution

LLM agents can autonomously exploit up to 13% of real-world, critical-severity web application vulnerabilities, a sobering statistic revealed by the new CVE-Bench benchmark.

Abstract

Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture the Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized expertise to reproduce exploits and a systematic approach to evaluating unpredictable threats. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our evaluation shows that the state-of-the-art agent framework can resolve up to 13% of vulnerabilities.

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Citation Metrics

Citations40

Influential citations4

References86

Year2025

VenueInternational Conference on Machine Learning

Related Papers

Finding related papers...

Search

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

Related Papers