ImperialZJUJan 4, 2026arXiv:2601.01592

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

Xin Wang, Yunhao Chen, Juncheng Li, Yixu Wang, Yang Yao, Tianle Gu, Jie Li, Yan Teng, Xingjun Ma, Yingchun Wang, Xia Hu

AI Summary

The paper introduces OpenRT, a modular and high-throughput red-teaming framework for evaluating the safety of Multimodal Large Language Models (MLLMs) across five dimensions: model integration, dataset management, attack strategies, judging methods, and evaluation metrics. OpenRT decouples adversarial logic from a high-throughput asynchronous runtime, enabling systematic scaling and integration of 37 diverse attack methodologies, including white-box gradients, multi-modal perturbations, and multi-agent evolutionary strategies. Empirical evaluation of 20 advanced MLLMs, including GPT-5.2, Claude 4.5, and Gemini 3 Pro, using OpenRT reveals significant safety vulnerabilities, with attack success rates reaching 49.14% even in frontier models, demonstrating a lack of generalization across attack paradigms.

Key Contribution

Even state-of-the-art multimodal LLMs like GPT-5.2 and Claude 4.5 can be jailbroken nearly half the time using OpenRT's diverse suite of attacks, revealing a critical lack of generalization across attack paradigms.

Abstract

The rapid integration of Multimodal Large Language Models (MLLMs) into critical applications is increasingly hindered by persistent safety vulnerabilities. However, existing red-teaming benchmarks are often fragmented, limited to single-turn text interactions, and lack the scalability required for systematic evaluation. To address this, we introduce OpenRT, a unified, modular, and high-throughput red-teaming framework designed for comprehensive MLLM safety evaluation. At its core, OpenRT architects a paradigm shift in automated red-teaming by introducing an adversarial kernel that enables modular separation across five critical dimensions: model integration, dataset management, attack strategies, judging methods, and evaluation metrics. By standardizing attack interfaces, it decouples adversarial logic from a high-throughput asynchronous runtime, enabling systematic scaling across diverse models. Our framework integrates 37 diverse attack methodologies, spanning white-box gradients, multi-modal perturbations, and sophisticated multi-agent evolutionary strategies. Through an extensive empirical study on 20 advanced models (including GPT-5.2, Claude 4.5, and Gemini 3 Pro), we expose critical safety gaps: even frontier models fail to generalize across attack paradigms, with leading models exhibiting average Attack Success Rates as high as 49.14%. Notably, our findings reveal that reasoning models do not inherently possess superior robustness against complex, multi-turn jailbreaks. By open-sourcing OpenRT, we provide a sustainable, extensible, and continuously maintained infrastructure that accelerates the development and standardization of AI safety.

Citation Metrics

Citations3

Influential citations1

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

Related Papers