Search papers, labs, and topics across Lattice.
The authors introduce Cybersecurity AI Benchmark (CAIBench), a modular meta-benchmark framework, to evaluate LLM-based cybersecurity agents across offensive and defensive domains. CAIBench integrates five evaluation categories, including CTFs, cyber range exercises, knowledge benchmarks, and privacy assessments, to address the limitations of existing benchmarks that assess isolated skills. Experiments with state-of-the-art AI models reveal a performance gap between security knowledge and adaptive capabilities, particularly in multi-step adversarial scenarios and robotic targets, highlighting the importance of a meta-benchmark approach.
LLMs may ace cybersecurity trivia, but when it comes to real-world attack-and-defend scenarios, their performance plummets, especially against robotic targets.
Cybersecurity spans multiple interconnected domains, complicating the development of meaningful, labor-relevant benchmarks. Existing benchmarks assess isolated skills rather than integrated performance. We find that pre-trained knowledge of cybersecurity in LLMs does not imply attack and defense abilities, revealing a gap between knowledge and capability. To address this limitation, we present the Cybersecurity AI Benchmark (CAIBench), a modular meta-benchmark framework that allows evaluating LLM models and agents across offensive and defensive cybersecurity domains, taking a step towards meaningfully measuring their labor-relevance. CAIBench integrates five evaluation categories, covering over 10,000 instances: Jeopardy-style CTFs, Attack and Defense CTFs, Cyber Range exercises, knowledge benchmarks, and privacy assessments. Key novel contributions include systematic simultaneous offensive-defensive evaluation, robotics-focused cybersecurity challenges (RCTF2), and privacy-preserving performance assessment (CyberPII-Bench). Evaluation of state-of-the-art AI models reveals saturation on security knowledge metrics (~70\% success) but substantial degradation in multi-step adversarial (A\&D) scenarios (20-40\% success), or worse in robotic targets (22\% success). The combination of framework scaffolding and LLM model choice significantly impacts performance; we find that proper matches improve up to 2.6$\times$ variance in Attack and Defense CTFs. These results demonstrate a pronounced gap between conceptual knowledge and adaptive capability, emphasizing the need for a meta-benchmark.