Mar 30, 2026arXiv:2603.28998

Design Principles for the Construction of a Benchmark Evaluating Security Operation Capabilities of Multi-agent AI Systems

Yicheng Cai, Mitchell John DeStefano, Guodong Dong, Pulkit Handa, Peng Liu, Tejas Singhal, Pei-Tzu Tseng, Winston Jen White

AI Summary

This paper identifies the need for benchmarks evaluating blue team capabilities of multi-agent AI systems in security operation centers (SOCs), as current benchmarks focus primarily on red team activities. It proposes a set of design principles for constructing SOC-bench, a benchmark to evaluate AI blue team capabilities. The conceptual design of SOC-bench includes five blue team tasks within the context of a large-scale ransomware attack incident response scenario.

Key Contribution

The lack of comprehensive benchmarks for AI blue teams leaves SOCs vulnerable, and this paper lays the groundwork for rectifying that gap.

Abstract

As Large Language Models (LLMs) and multi-agent AI systems are demonstrating increasing potential in cybersecurity operations, organizations, policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such AI systems to achieve more autonomous SOCs (security operation centers) and reduce manual effort. In particular, the AI and cybersecurity communities have recently developed several benchmarks for evaluating the red team capabilities of multi-agent AI systems. However, because the operations in SOCs are dominated by blue team operations, the capabilities of AI systems&agents to achieve more autonomous SOCs cannot be evaluated without a benchmark focused on blue team operations. To our best knowledge, no systematic benchmark for evaluating coordinated multi-task blue team AI has been proposed in the literature. Existing blue team benchmarks focus on a particular task. The goal of this work is to develop a set of design principles for the construction of a benchmark, which is denoted as SOC-bench, to evaluate the blue team capabilities of AI. Following these design principles, we have developed a conceptual design of SOC-bench, which consists of a family of five blue team tasks in the context of large-scale ransomware attack incident response.

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Design Principles for the Construction of a Benchmark Evaluating Security Operation Capabilities of Multi-agent AI Systems

Related Papers