Alias RoboticsOct 28, 2025arXiv:2510.24317

Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents

Mar'ia Sanz-G'omez, V. Vilches, Francesco Balassone, Luis Javier Navarrete-Lozano, C. R. J. V. Chavez, Maite del Mundo de Torres

AI Summary

The authors introduce Cybersecurity AI Benchmark (CAIBench), a modular meta-benchmark framework, to evaluate LLM-based cybersecurity agents across offensive and defensive domains. CAIBench integrates five evaluation categories, including CTFs, cyber range exercises, knowledge benchmarks, and privacy assessments, to address the limitations of existing benchmarks that assess isolated skills. Experiments with state-of-the-art AI models reveal a performance gap between security knowledge and adaptive capabilities, particularly in multi-step adversarial scenarios and robotic targets, highlighting the importance of a meta-benchmark approach.

Key Contribution

LLMs may ace cybersecurity trivia, but when it comes to real-world attack-and-defend scenarios, their performance plummets, especially against robotic targets.

Abstract

Cybersecurity spans multiple interconnected domains, complicating the development of meaningful, labor-relevant benchmarks. Existing benchmarks assess isolated skills rather than integrated performance. We find that pre-trained knowledge of cybersecurity in LLMs does not imply attack and defense abilities, revealing a gap between knowledge and capability. To address this limitation, we present the Cybersecurity AI Benchmark (CAIBench), a modular meta-benchmark framework that allows evaluating LLM models and agents across offensive and defensive cybersecurity domains, taking a step towards meaningfully measuring their labor-relevance. CAIBench integrates five evaluation categories, covering over 10,000 instances: Jeopardy-style CTFs, Attack and Defense CTFs, Cyber Range exercises, knowledge benchmarks, and privacy assessments. Key novel contributions include systematic simultaneous offensive-defensive evaluation, robotics-focused cybersecurity challenges (RCTF2), and privacy-preserving performance assessment (CyberPII-Bench). Evaluation of state-of-the-art AI models reveals saturation on security knowledge metrics (~70\% success) but substantial degradation in multi-step adversarial (A\&D) scenarios (20-40\% success), or worse in robotic targets (22\% success). The combination of framework scaffolding and LLM model choice significantly impacts performance; we find that proper matches improve up to 2.6$\times$ variance in Attack and Defense CTFs. These results demonstrate a pronounced gap between conceptual knowledge and adaptive capability, emphasizing the need for a meta-benchmark.

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Citation Metrics

Citations4

Influential citations1

References20

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents

Related Papers