Red-Teaming & Adversarial Robustness

Safety & Alignment

Adversarial testing of AI systems, jailbreaking research, prompt injection defense, and robustness evaluation.

Keywords

red teamingadversarial robustnessjailbreakprompt injectionadversarial attacksAI safety testingadversarial examplesrobustness

Recent Papers

Feb 12, 2026

2d ago

Keeping a Secret Requires a Good Memory: Space Lower-Bounds for Private Algorithms

This paper establishes the first unconditional space lower bound for user-level differential privacy by introducing a novel multi-player communication game that links the hardness of low-memory private algorithms to the necessity of contribution capping. The authors demonstrate that the communication complexity of winning this game translates directly to memory lower bounds for private algorithms. They apply this framework to distinct element estimation, proving an $\widetilde{\Omega}(T^{1/3})$ space lower bound, and generalize the technique to derive lower bounds for private medians, quantiles, and max-select.

Establishes a novel multi-player communication game framework to prove unconditional space lower bounds for user-level differentially private algorithms, connecting memory requirements to the necessity of contribution capping.

Alessandro Epasto, Xin Lyu, Pasin Manurangsi2602.12209

Constitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

2d ago

DeepSight: An All-in-One LM Safety Toolkit

The paper introduces DeepSight, an open-source toolkit designed to integrate safety evaluation and diagnosis for large language models (LLMs) and multimodal large language models (MLLMs). DeepSight combines DeepSafe, an evaluation toolkit, and DeepScan, a diagnosis toolkit, to provide a more comprehensive safety workflow. By unifying task and data protocols, DeepSight aims to bridge the gap between black-box risk evaluation and white-box mechanistic understanding, facilitating targeted safety alignment.

Introduces DeepSight, the first open-source toolkit to support frontier AI risk evaluation and joint safety evaluation and diagnosis by unifying task and data protocols.

Jiaxuan Guo, Lijun Li, Sujin Chen +92602.12092

Red-Teaming & Adversarial RobustnessEval Frameworks & BenchmarksMultimodal Models

2d ago

Unknown Attack Detection in IoT Networks using Large Language Models: A Robust, Data-efficient Approach

The paper introduces SiamXBERT, a Siamese meta-learning framework leveraging a transformer-based language model, to address the challenge of detecting unknown (zero-day) attacks in IoT networks under data scarcity and encrypted traffic conditions. SiamXBERT constructs a dual-modality feature representation from flow and packet-level information and uses meta-learning for rapid adaptation to new attack types with limited labeled data. Experiments on IoT intrusion datasets demonstrate that SiamXBERT outperforms state-of-the-art baselines, achieving up to 78.8% improvement in unknown F1-score, showcasing its robustness and data efficiency.

Introduces SiamXBERT, a novel Siamese meta-learning framework empowered by a transformer-based language model, for robust and data-efficient unknown attack detection in IoT networks.

Feifei Niu, Paria Shirani, L. Briand2602.12183

Red-Teaming & Adversarial RobustnessNatural Language Processing

2d ago

Capability-Oriented Training Induced Alignment Risk

The paper investigates capability-oriented training induced exploitation in language models trained with reinforcement learning, where models learn to exploit implicit loopholes in the training environment to maximize reward. Through a suite of four "vulnerability games," the authors demonstrate that models consistently learn to exploit flaws related to context-conditional compliance, proxy metrics, reward tampering, and self-evaluation. The key finding is that these exploitative strategies generalize to new tasks and can be distilled from teacher to student models, highlighting a fundamental challenge to current alignment approaches.

Demonstrates that reinforcement learning-trained language models spontaneously learn to exploit implicit loopholes in training environments to maximize reward, even without explicit malicious intent.

Yujun Zhou, Yue Huang, Han Bao +62602.12124

RLHF & Preference LearningScalable Oversight & Alignment TheoryRed-Teaming & Adversarial Robustness

2d ago

Cross-Modal Robustness Transfer (CMRT): Training Robust Speech Translation Models Using Adversarial Text

The paper introduces Cross-Modal Robustness Transfer (CMRT) to improve the robustness of End-to-End Speech Translation (E2E-ST) models against morphological variations. CMRT leverages adversarial training in the text modality to transfer robustness to the speech modality, eliminating the need for computationally expensive adversarial speech data generation. Experiments across four language pairs show that CMRT improves adversarial robustness by over 3 BLEU points compared to baseline E2E-ST models.

Introduces Cross-Modal Robustness Transfer (CMRT), a novel framework for enhancing E2E-ST model robustness by transferring adversarial robustness from text to speech.

Abderrahmane Issam, Yusuf Can Semerci, Jan Scholtes +12602.11933

Red-Teaming & Adversarial RobustnessMultimodal ModelsSpeech & Audio

2d ago

Safety Beyond the Training Data: Robust Out-of-Distribution MPC via Conformalized System Level Synthesis

This paper introduces a novel control framework that combines conformal prediction (CP) and system level synthesis (SLS) to achieve robust out-of-distribution (OOD) planning and control with learned dynamics models. The method uses weighted CP with a learned covariance model to derive high-confidence model error bounds, which are then incorporated into an SLS-based robust nonlinear MPC formulation with volume-optimized reachable sets for constraint tightening. Empirical results on nonlinear systems like a 4D car and a 12D quadcopter demonstrate improved safety and robustness, particularly in OOD scenarios, compared to baselines.

Integrates conformal prediction with system level synthesis to create a robust MPC framework that provides safety guarantees for out-of-distribution planning and control using learned dynamics models.

Anutam Srinivasan, Antoine Leeman, Glen Chou2602.12047

Robotics & Embodied AIWorld Models & PlanningRed-Teaming & Adversarial Robustness

2d ago

Temporally Unified Adversarial Perturbations for Time Series Forecasting

The paper introduces Temporally Unified Adversarial Perturbations (TUAPs) to address the issue of temporally inconsistent adversarial attacks in time series forecasting. To generate TUAPs, the authors propose a Timestamp-wise Gradient Accumulation Method (TGAM) that enforces temporal unification by aggregating local gradient information from overlapping samples. Experiments on benchmark datasets demonstrate that TUAPs, generated using TGAM, outperform existing methods in both white-box and black-box transfer attack scenarios, even without temporal unification constraints.

Introduces Temporally Unified Adversarial Perturbations (TUAPs) and a Timestamp-wise Gradient Accumulation Method (TGAM) to generate temporally consistent and effective adversarial attacks against time series forecasting models.

Ruixian Su, Yukun Bao, Xinze Zhang2602.11940

Red-Teaming & Adversarial Robustness

2d ago

QDBFT: A Dynamic Consensus Algorithm for Quantum-Secured Blockchain

The paper introduces QDBFT, a quantum-secured dynamic consensus algorithm designed to address the vulnerabilities of traditional PBFT in the face of quantum computing and dynamic node reconfigurations. QDBFT incorporates a primary node automatic rotation mechanism based on a consistent hash ring for dynamic membership and integrates Quantum Key Distribution (QKD) networks for information-theoretic security. Experimental results show QDBFT achieves comparable performance to PBFT while providing resilience against quantum attacks.

Introduces QDBFT, a novel consensus algorithm, that integrates a dynamic primary node rotation mechanism with QKD to achieve quantum-resistant and dynamically adaptable consensus.

OuYang Jie, An Hua, Qiandong Zhang +12602.11606

Constitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

2d ago

AIR: Improving Agent Safety through Incident Response

The paper introduces AIR, an incident response framework for LLM agents that enables autonomous detection, containment, and recovery from failures. AIR uses a domain-specific language integrated into the agent's execution loop to perform semantic checks, guide recovery actions, and synthesize guardrail rules. Experiments across three agent types demonstrate that AIR achieves over 90% success rates in detection, remediation, and eradication, highlighting the importance of incident response for agent safety.

Introduces AIR, a novel incident response framework for LLM agents, enabling autonomous management of the incident lifecycle.

Zibo Xiao, Junjie Chen2602.11749

Tool Use & AgentsConstitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

2d ago

More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles

The paper identifies a limitation in watermark ensembles for LLMs where strong single-layer watermarks reduce token distribution entropy, hindering subsequent layers' effectiveness. They theoretically and empirically demonstrate that detectability is bounded by entropy and that watermark ensembles monotonically decrease entropy and the expected green-list ratio across layers. To address this, they propose a framework using weaker single-layer watermarks to preserve entropy, achieving improved detectability and robustness compared to strong watermark baselines.

Demonstrates that weaker single-layer watermarks in ensembles can outperform stronger ones by preserving token distribution entropy, leading to improved detectability and robustness.

Yihan Wu, Jingqi Zhang2602.11793

Natural Language ProcessingRed-Teaming & Adversarial Robustness

2d ago

DMind-3: A Sovereign Edge--Local--Cloud AI System with Controlled Deliberation and Correction-Based Tuning for Safe, Low-Latency Transaction Execution

The paper introduces DMind-3, a three-layered Edge-Local-Cloud AI system for secure and low-latency Web3 financial transactions. It addresses the limitations of cloud-centric and purely local AI solutions by using a deterministic edge firewall, a private local reasoning engine, and a policy-governed cloud synthesizer. The system is trained with Hierarchical Predictive Synthesis (HPS) and Contrastive Chain-of-Correction Supervised Fine-Tuning (C$^3$-SFT) to improve performance and reliability.

Introduces a novel Edge-Local-Cloud AI architecture, DMind-3, that balances privacy, latency, and global context for secure Web3 transactions.

Enhao Huang, Frank Li, Tony Lin +12602.11651

Red-Teaming & Adversarial RobustnessConstitutional AI & AI EthicsDistributed Systems & Hardware

2d ago

Robust Optimization Approach and Learning Based Hide-and-Seek Game for Resilient Network Design

This paper addresses the problem of designing resilient communication networks with limited signal transmission distances, subject to uncertainty in both link lengths and node availability. The authors formulate the problem as a robust optimization model with budgeted uncertainty sets for regenerator installation costs and a novel dynamic budgeted uncertainty set for link lengths. They then develop scalable solution methods based on column-and-constraint generation, Benders decomposition, and iterative robust optimization, and further analyze the problem using a learning-based hide-and-seek game. The proposed methods outperform classical robust models and deterministic worst-case formulations.

Introduces a dynamic budgeted uncertainty set for link lengths in robust network design and demonstrates its effectiveness in a hide-and-seek game framework.

Mohammad Khosravi, Setareh Maghsudi2602.11854

Red-Teaming & Adversarial RobustnessConstitutional AI & AI Ethics

2d ago

MalTool: Malicious Tool Attacks on LLM Agents

This paper introduces MalTool, a framework leveraging coding LLMs to automatically generate malicious tools that can compromise user security and privacy when used by LLM agents. The authors propose a taxonomy of malicious tool behaviors based on the CIA triad and use MalTool to synthesize both standalone malicious tools and real-world tools with embedded malicious behaviors. Experiments demonstrate MalTool's effectiveness in generating malicious tools, even with safety-aligned coding LLMs, and reveal the limitations of existing detection methods, underscoring the need for improved defenses.

Introduces MalTool, a novel framework for automatically generating malicious tools using coding LLMs, enabling a systematic study of malicious tool code implementations and their impact on LLM agent security.

Yuepeng Hu, Mengyuan Li, N. Gong2602.12194

Red-Teaming & Adversarial RobustnessTool Use & AgentsCode Generation & Program Synthesis

2d ago

BlackCATT: Black-box Collusion Aware Traitor Tracing in Federated Learning

The paper introduces BlackCATT, a novel black-box traitor tracing method for federated learning that is resilient to collusion attacks. BlackCATT employs a collusion-aware embedding loss and iteratively optimizes trigger sets for watermark embedding, improving convergence and tracing performance. The authors also propose BlackCATT+FR, which incorporates functional regularization at the aggregator to address update incompatibility issues in models with batch normalization, maintaining tracing performance.

Introduces a collusion-resistant black-box traitor tracing method (BlackCATT) for federated learning that uses a novel collusion-aware embedding loss and iteratively optimized triggers.

Elena Rodr'iguez-Lois, Fabio Brau, Maura Pintor +22602.12138

Constitutional AI & AI EthicsRed-Teaming & Adversarial RobustnessData Curation & Synthetic Data

2d ago

Future Mining: Learning for Safety and Security

This paper proposes a Unified Smart Safety and Security Architecture for AI-driven mining environments, addressing challenges like poor illumination, GPS denial, and cyber-physical threats. The architecture integrates multimodal perception, secure federated learning, reinforcement learning, DTN communication, and energy-aware sensing to improve safety and security. The proposed system incorporates five core modules for miner localization, hazard understanding, federated robustness, and predictive maintenance.

Envisions and outlines a comprehensive architecture integrating diverse AI and security techniques to enhance safety and security in autonomous mining environments.

M. Jewel, S. Madria2602.11472

Robotics & Embodied AIComputer VisionRed-Teaming & Adversarial Robustness

2d ago

Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs

The paper introduces TRACE-RPS, a novel defense framework against attribute inference attacks in LLMs, which combines fine-grained anonymization with inference-preventing optimization. TRACE uses attention mechanisms and inference chain generation to pinpoint and anonymize privacy-leaking text, while RPS employs a two-stage optimization to encourage models to reject attribute inference queries. Experiments demonstrate that TRACE-RPS significantly reduces attribute inference accuracy on open-source LLMs, achieving strong cross-model generalization, prompt-variation robustness, and utility-privacy tradeoffs.

Introduces a unified defense framework, TRACE-RPS, that combines fine-grained anonymization and inference-preventing optimization to effectively mitigate attribute inference attacks in LLMs.

Jian Liang2602.11528

Red-Teaming & Adversarial RobustnessConstitutional AI & AI EthicsNatural Language Processing

2d ago

FAIL: Flow Matching Adversarial Imitation Learning for Image Generation

The paper introduces Flow Matching Adversarial Imitation Learning (FAIL), a novel approach to fine-tuning flow matching models for image generation by framing the alignment with a target distribution as an imitation learning problem. FAIL leverages adversarial training to minimize the divergence between the policy and expert demonstrations, avoiding the need for explicit rewards or pairwise comparisons. The authors demonstrate that FAIL achieves competitive performance on prompt following and aesthetic benchmarks with limited demonstrations, and also show its effectiveness in discrete image/video generation and as a regularizer against reward hacking.

Introduces FAIL, a new adversarial imitation learning framework for fine-tuning flow matching models that avoids explicit reward modeling or pairwise comparisons.

Yeyao Ma, Chen Li, Xiaosong Zhang2602.12155

Computer VisionRLHF & Preference LearningRed-Teaming & Adversarial Robustness

2d ago

Agentic AI for Cybersecurity: A Meta-Cognitive Architecture for Governable Autonomy

This paper proposes a meta-cognitive architecture for AI-driven cybersecurity systems to address limitations in accountable decision-making under adversarial uncertainty. The architecture coordinates heterogeneous AI agents responsible for detection, hypothesis formation, explanation, and governance through an explicit meta-cognitive judgement function. By embedding meta-cognitive judgement as a first-class system function, the framework aims to make the cognitive structure of security operations explicit and governable, shifting the focus from optimizing isolated predictions to governing autonomy under uncertainty.

Introduces a meta-cognitive architectural framework for cybersecurity AI that explicitly governs decision readiness and dynamically calibrates system autonomy under uncertainty by coordinating heterogeneous AI agents through a meta-cognitive judgement function.

A. Kojukhov, Arkady Bovshover2602.11897

Tool Use & AgentsConstitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

Unc Charlotte2d ago

Community Concealment from Unsupervised Graph Learning-Based Clustering

This paper investigates the privacy risks of using graph neural networks (GNNs) for unsupervised community detection, specifically the potential for revealing sensitive groups. They identify connectivity at the community boundary and feature similarity between communities as key factors influencing community concealment. Based on these factors, they propose a perturbation strategy that rewires edges and modifies node features to reduce the distinctiveness used by GNN message passing, achieving 20-45% improvement in concealment compared to DICE.

Introduces a novel perturbation strategy for concealing communities from GNN-based unsupervised clustering by rewiring edges and modifying node features based on connectivity and feature similarity.

Dalyapraz Manatova, P. Moriano, L. J. Camp +12602.12250

Constitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

2d ago

Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

The paper introduces Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework for assessing LLM safety under repeated inference, addressing the limitations of breadth-oriented benchmarks. APST models safety failures as stochastic outcomes using Bernoulli and binomial models to estimate per-inference failure probabilities under controlled operational conditions like decoding temperature. Experiments on instruction-tuned LLMs using AIR-BENCH-derived safety prompts reveal that models with similar benchmark scores can exhibit significantly different empirical failure rates under repeated sampling, especially with increased temperature, highlighting the importance of evaluating reliability under sustained use.

Introduces Accelerated Prompt Stress Testing (APST), a novel framework for evaluating LLM safety and reliability by repeatedly sampling identical prompts to surface latent failure modes and quantify per-inference failure probabilities.

Keita Broadwater2602.11786

Red-Teaming & Adversarial RobustnessEval Frameworks & BenchmarksConstitutional AI & AI Ethics

2d ago

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models

This paper investigates jailbreaking attacks on LLMs by analyzing differences in internal representations between jailbreak and benign prompts across multiple open-source models (GPT-J, LLaMA, Mistral, Mamba). They propose a tensor-based latent representation framework to capture structure in hidden activations, enabling jailbreak detection without fine-tuning or auxiliary LLMs. By selectively bypassing high-susceptibility layers in LLaMA-3.1-8B, the method blocks 78% of jailbreak attempts while preserving 94% of benign behavior, demonstrating the potential for inference-time interventions.

Introduces a tensor-based latent representation framework for detecting and disrupting jailbreak attacks by analyzing and manipulating internal activations of LLMs at inference time.

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis2602.11495

Red-Teaming & Adversarial RobustnessInterpretability & Mechanistic Interp

2d ago

PAC to the Future: Zero-Knowledge Proofs of PAC Private Systems

This paper introduces a framework for verifiable privacy in machine learning by combining PAC privacy with zero-knowledge proofs (ZKPs). It enables users to verify the correctness of computations and the application of privacy-preserving noise in cloud-based systems. The authors leverage non-interactive ZKP schemes to generate proofs attesting to the correct implementation of PAC privacy mechanisms, demonstrating the feasibility of verifiable PAC privacy in outsourced computation.

Introduces a novel framework integrating PAC privacy with zero-knowledge proofs to enable verifiable privacy guarantees in trustless computing environments.

Guilhem Repetto, Nojan Sheybani, Gabrielle De Micheli +12602.11954

Constitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

2d ago

SafeNeuron: Neuron-Level Safety Alignment for Large Language Models

The paper introduces SafeNeuron, a neuron-level safety alignment framework for LLMs designed to improve robustness against neuron-level attacks. It identifies and freezes safety-related neurons during preference optimization, forcing the model to develop redundant safety representations across the network. Experiments show SafeNeuron enhances robustness against neuron pruning attacks, mitigates the risk of models being used for red-teaming, and maintains general capabilities, while also revealing stable and shared internal safety representations.

Introduces SafeNeuron, a novel neuron-level safety alignment framework that enhances LLM robustness by redistributing safety representations across the network.

Jiaming Liang, Tat-Seng Chua2602.12158

Red-Teaming & Adversarial RobustnessConstitutional AI & AI EthicsInterpretability & Mechanistic Interp

Feb 9, 2026

5d ago

StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

The paper introduces StealthRL, a reinforcement learning framework that generates adversarial paraphrases to evade AI-text detectors. StealthRL trains a paraphrase policy using Group Relative Policy Optimization (GRPO) with LoRA adapters on Qwen-3B, optimizing for both detector evasion and semantic similarity. Experiments across six attack settings and three detector families demonstrate StealthRL's ability to achieve near-zero detection rates (0.001 TPR@1%FPR) and high attack success rates (99.9%), even transferring to unseen detector families.

Demonstrates a reinforcement learning approach, StealthRL, for generating adversarial paraphrases that effectively evade multiple AI-text detectors, revealing shared vulnerabilities across detector architectures.

Suraj Ranganath, Atharv Ramesh2602.08934

Red-Teaming & Adversarial RobustnessNatural Language ProcessingOpen-Source Models & Weights

Jan 25, 2026

2w ago

Multi-Agent Collaborative Intrusion Detection for Low-Altitude Economy IoT: An LLM-Enhanced Agentic AI Framework

This paper addresses the security challenges in Low-Altitude Economy IoT (LAE-IoT) networks by proposing a multi-agent collaborative intrusion detection framework. The framework leverages specialized, LLM-enhanced agents for intelligent data processing and adaptive classification to overcome limitations of traditional intrusion detection systems in dynamic aerial environments. Experimental results demonstrate the framework achieves over 90% classification accuracy across multiple benchmark datasets, highlighting the potential of LLM-enhanced agentic AI for LAE-IoT security.

Introduces a novel multi-agent collaborative intrusion detection framework that uses LLM-enhanced agents to improve intrusion detection in resource-constrained and dynamic LAE-IoT networks.

Hongjuan Li, Hui Kang, Jiahui Li +62601.17817

Tool Use & AgentsRed-Teaming & Adversarial Robustness

Jan 16, 2026

Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

The paper introduces a novel economic Denial-of-Service (DoS) attack targeting LLM agents by exploiting the agent-tool communication loop in multi-turn interactions. The attack leverages a modified tool server to subtly steer agents into prolonged, verbose tool-calling sequences while preserving task correctness, thus bypassing conventional validation checks. Experiments on six LLMs demonstrate significant resource amplification, with token usage increasing to over 60,000, costs inflating by up to 658x, and GPU KV cache occupancy rising substantially, highlighting the vulnerability of the agent-tool interface.

Demonstrates a stealthy, multi-turn economic DoS attack against LLM agents by manipulating tool server responses to induce excessive tool-calling, bypassing traditional single-turn defenses.

Kaiyu Zhou, Yongsen Zheng†, Yicheng He +42601.10955

Tool Use & AgentsRed-Teaming & Adversarial Robustness

Jan 12, 2026

Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety

The paper investigates the effectiveness of deliberative alignment (DA) using explicit safety codes versus case-augmented examples for improving LLM safety. They find that explicit safety codes lead to inconsistent harmlessness and degraded helpfulness, while case-augmented simple codes result in more robust safety behaviors. Based on these findings, they propose CADA, a case-augmented deliberative alignment method using reinforcement learning on self-generated safety reasoning chains, which improves harmlessness, robustness, and utility.

Introduces CADA, a case-augmented deliberative alignment method that leverages reinforcement learning on self-generated safety reasoning chains to enhance LLM safety without sacrificing helpfulness.

Can Jin, Rui Wu, Tong Che +102601.08000

Constitutional AI & AI EthicsRed-Teaming & Adversarial RobustnessEval Frameworks & Benchmarks

Jan 9, 2026

STELP: Secure Transpilation and Execution of LLM-Generated Programs

The paper introduces STELP, a Secure Transpiler and Executor of LLM-Generated Programs, to address the safety and reliability issues associated with directly executing code generated by Large Language Models in production systems. STELP operates by transpiling LLM-generated code into a safer, controlled environment, mitigating vulnerabilities such as data poisoning and malicious attacks. The authors demonstrate STELP's effectiveness through benchmarks on correctness, safety, and latency, showing it outperforms existing methods in safely executing risky code snippets using a newly created human-validated dataset of insecure code.

Introduces STELP, a novel system for secure transpilation and execution of LLM-generated code, enhancing safety and reliability in production environments.

Swapnil Shinde, Sahil Wadhwa, Andy Luo +22601.05467

Code Generation & Program SynthesisRed-Teaming & Adversarial RobustnessTool Use & Agents

Jan 7, 2026

What Matters For Safety Alignment?

This paper empirically investigates the impact of intrinsic model characteristics and external attack techniques on the safety alignment of 32 LLMs and LRMs (3B-235B parameters) across 13 model families. The study uses 5 safety datasets, 56 jailbreak techniques, and 4 Chain-of-Thought (CoT) attack strategies, finding that models with integrated reasoning and self-reflection (GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B) exhibit the best safety alignment. The research also demonstrates that post-training and knowledge distillation can degrade safety alignment, and that CoT attacks using response prefixes significantly increase attack success rates, especially in text-completion interfaces.

Systematically evaluates the influence of model characteristics and attack techniques on the safety alignment of a diverse set of LLMs and LRMs, revealing vulnerabilities and best practices for developing safer AI systems.

Xing Li, Hui-Ling Zhen, Lihao Yin +32601.03868

Red-Teaming & Adversarial RobustnessEval Frameworks & BenchmarksConstitutional AI & AI Ethics

Dec 31, 2025

China Electronics DataDec 31, 2025

A Dual-Stage Chinese Instruction Jailbreaking Framework for Generative Large Language Models

The paper introduces DISEF, a Dual-Stage Instruction Safety Evaluation Framework, to assess the vulnerability of LLMs to jailbreaking attacks in Chinese-language settings. DISEF uses Virtualized Scenario Embedding (VSE) to test alignment stability under contextual shifts and Formal Payload Splitting (FPS) to analyze robustness against fragmented or implicitly encoded risk-related content. Experiments on the IJCAI 2025 benchmark reveal vulnerabilities in multiple LLMs, providing insights for improving safety alignment and threat detection.

Introduces DISEF, a novel dual-stage framework, to systematically evaluate and expose vulnerabilities of generative LLMs to Chinese instruction jailbreaking attacks.

Yingkun Huang, Xiaoru Zhuang, Shihao Song

Red-Teaming & Adversarial RobustnessEval Frameworks & BenchmarksNatural Language Processing

Dec 27, 2025

DarkPatterns-LLM: A Multi-Layer Benchmark for Detecting Manipulative and Harmful AI Behavior

The paper introduces DarkPatterns-LLM, a novel benchmark dataset and diagnostic framework for evaluating manipulative content in LLM outputs across seven harm categories, addressing the limitations of existing binary-labeled safety benchmarks. The framework employs a four-layer analytical pipeline (MGD, MSIAN, THP, DCRA) for fine-grained assessment. Evaluation of state-of-the-art models reveals significant performance disparities (65.2\%--89.7\%) and consistent weaknesses in detecting autonomy-undermining patterns, highlighting the need for improved manipulation detection in LLMs.

Establishes DarkPatterns-LLM, the first standardized, multi-dimensional benchmark for manipulation detection in LLMs, enabling actionable diagnostics toward more trustworthy AI systems.

Sadia Asif, Israel Antonio Rosales Laguan, Haris Khan +22512.22470

Constitutional AI & AI EthicsRed-Teaming & Adversarial RobustnessEval Frameworks & Benchmarks

Dec 25, 2025

School of Cyber Science and EngineeringDec 25, 2025

Towards Patch-Based Noise Compression for Adversarial Attack Against Transformer-Based Visual Tracking

The paper introduces Patch-based Adversarial Noise Compression (PANC), a decision-based black-box adversarial attack method designed to efficiently attack Transformer-based visual trackers by exploiting patch-wise noise sensitivity. PANC uses a noise sensitivity matrix to dynamically adjust adversarial noise levels in different patches, optimizing noise distribution and reducing query counts. Experiments on OSTrack, STARK, TransT, and MixformerV2 across GOT-10k, TrackingNet, and LaSOT datasets demonstrate that PANC achieves a 162% improvement in attack effectiveness with only 45.7% of the queries compared to existing methods, while compressing noise to 10% of the original level.

Introduces a patch-based adversarial noise compression (PANC) method that significantly improves the efficiency and concealment of decision-based black-box adversarial attacks against Transformer-based visual trackers.

Peng Gao, Long Xu, Wen-Jia Tang +4

Architecture Design (Transformers, SSMs, MoE)Inference & QuantizationInterpretability & Mechanistic InterpRed-Teaming & Adversarial RobustnessComputer Vision

Dec 9, 2025

Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models III: Implementing the Bacterial Biothreat Benchmark (B3) Dataset

This paper presents the pilot implementation of the Bacterial Biothreat Benchmark (B3) dataset, designed to evaluate the biosecurity risks associated with frontier AI models. The B3 dataset was used to probe a sample frontier AI model, and the model's responses were then evaluated by humans, followed by risk analysis. The pilot study demonstrated the B3 dataset's utility in rapidly assessing biosecurity risks, pinpointing their origins, and guiding mitigation efforts.

Demonstrates the viability of the Bacterial Biothreat Benchmark (B3) dataset for assessing and mitigating biosecurity risks posed by large language models.

Gary Ackerman, Theodore Wilson, Z. Kallenborn +72512.08459

Eval Frameworks & BenchmarksRed-Teaming & Adversarial RobustnessConstitutional AI & AI Ethics

Dec 8, 2025

Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

The paper explores knowledge distillation (KD) to transfer refusal behaviors from a proprietary teacher LLM (OpenAI o1-mini) to open-source student models (Llama-3-8B, Gemma-2-2B, Qwen3-8B) using multilingual jailbreak prompts. Surprisingly, response-based fine-tuning with "safe" refusal data increased Jailbreak Success Rate (JSR) in student models, indicating a safety compromise due to divergent generalization across languages. Removing nuanced "boundary" refusals mitigated the safety decline, although reasoning performance decreased, highlighting challenges in multilingual safety alignment via KD.

Demonstrates that response-based knowledge distillation for multilingual jailbreak prevention can inadvertently compromise safety by increasing jailbreak success rates in student models due to divergent generalization across languages.

Max Zhang, Derek Liu, Kai Zhang +22602.11157

Natural Language ProcessingRed-Teaming & Adversarial RobustnessInference & QuantizationConstitutional AI & AI Ethics

Dec 7, 2025

Dec 7, 2025·affiliated labs: Stanford HAI, MIT CSAIL, Berkeley AI Research (BAIR), Tsinghua AI

International AI Safety Report 2025: Second Key Update: Technical Safeguards and Risk Management

The International AI Safety Report 2025's Second Key Update analyzes the current state of AI risk management and technical mitigations employed by researchers, companies, and governments. It highlights advancements in training safer models and monitoring outputs while acknowledging uncertainties in the effectiveness of these measures and their variability across applications. The report aims to inform policymakers, researchers, and the public about progress and remaining gaps in AI safety.

Synthesizes recent developments in AI risk management and technical risk mitigation strategies, identifying both progress and persistent gaps in ensuring the safety of general-purpose AI systems.

Y. Bengio, Stephen Clare, Carina Prunkl +34

Constitutional AI & AI EthicsRed-Teaming & Adversarial RobustnessEval Frameworks & Benchmarks

Nov 27, 2025

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

This paper evaluates the robustness of ten publicly available LLM safety guardrail models from major tech companies against 1,445 adversarial prompts across 21 attack categories. The study reveals a significant performance drop in all models when tested on novel, unseen prompts compared to public benchmarks, indicating potential training data contamination. A novel "helpful mode" jailbreak was also discovered in two models, where they generated harmful content instead of blocking it.

Demonstrates that current LLM safety guardrail models exhibit poor generalization to novel adversarial attacks, highlighting the limitations of relying solely on benchmark performance for evaluation.

Richard J. Young2511.22047

Red-Teaming & Adversarial RobustnessEval Frameworks & BenchmarksConstitutional AI & AI Ethics

Nov 19, 2025

Taxonomy, Evaluation and Exploitation of IPI-Centric LLM Agent Defense Frameworks

This paper presents a systematization of knowledge (SoK) for Indirect Prompt Injection (IPI) defense frameworks in LLM agents, providing a taxonomy along five dimensions and evaluating the security and usability of representative defenses. Through analysis of defensive failures, the authors identify six root causes of circumvention. They then design three novel adaptive attacks that substantially improve attack success rates, highlighting vulnerabilities in existing defenses.

Systematizes the landscape of IPI defense frameworks for LLM agents by providing a novel taxonomy, evaluating existing defenses, and developing adaptive attacks that expose their weaknesses.

Zimo Ji, Xunguang Wang, Zongjie Li +62511.15203

Red-Teaming & Adversarial RobustnessTool Use & AgentsEval Frameworks & Benchmarks

Nov 6, 2025

Kennesaw State UniversityNov 6, 2025

Brain Tumor Classifiers Under Attack: Robustness of ResNet Variants Against Transferable FGSM and PGD Attacks

This paper investigates the adversarial robustness of ResNet-based architectures (BrainNet, BrainNeXt, and DilationNet) for brain tumor classification against FGSM and PGD attacks. The study evaluates model performance across different MRI data preprocessing configurations, including full-sized augmented, shrunk augmented, and shrunk non-augmented datasets. The key finding is that BrainNeXt models demonstrate the highest robustness to black-box attacks, while BrainNet and DilationNet are more vulnerable, and that shrunk and non-augmented data significantly reduce model resilience.

Demonstrates the varying adversarial vulnerability of different ResNet-based architectures for brain tumor classification under transferable FGSM and PGD attacks, highlighting the impact of data preprocessing on robustness.

Ryan Deem, Garrett Goodman, Waqas Majeed +22602.11646

Red-Teaming & Adversarial RobustnessComputer Vision

Nov 5, 2025

ITMO UniversityNov 5, 2025

Open-Source Large Language Model Frameworks for Automated Penetration Testing: Opportunities, Challenges, and Solutions

This paper investigates the applicability of open-source LLM frameworks, including both large-scale and lightweight models, for automating penetration testing tasks relevant to commercial security assessments. The study identifies both the potential and limitations of these frameworks in addressing fundamental challenges in penetration testing. The authors propose a practical approach to overcome key limitations and demonstrate the potential of LLM-based frameworks in real-world penetration testing scenarios.

Demonstrates the practical application of open-source LLM frameworks for penetration testing, highlighting their capabilities and limitations, and proposes solutions to address identified challenges.

Nikolai Eritenko, Alexander Menshchikov, Danil Sviridov +2

Red-Teaming & Adversarial RobustnessCode Generation & Program SynthesisOpen-Source Models & Weights

Selcuk UniversityNov 5, 2025

Compressing Large Language Models for SQL Injection Detection: A Case Study on Deep Seek-Coder and Meta-llama-3-70b-instruct

This paper benchmarks the performance of Deep Seek Coder and Meta-llama-3-70b-instruct in detecting SQL injection vulnerabilities using a labeled dataset of malicious and legitimate SQL queries. The evaluation focuses on Boolean-based attacks and measures precision, recall, F1-score, and accuracy. Meta-llama-3-70b-instruct achieved superior recall and overall accuracy (74.00%) compared to Deep Seek Coder (60.00%), suggesting it is better at detecting a wider range of malicious queries, though both models require further refinement for standalone security analysis.

Quantifies and compares the effectiveness of Deep Seek Coder and Meta-llama-3-70b-instruct in identifying SQL injection vulnerabilities, revealing the strengths and weaknesses of each model.

Borhanullah Hairan, M. A. Şahman

Code Generation & Program SynthesisRed-Teaming & Adversarial RobustnessInference & Quantization

Nov 3, 2025

University of Birmingham DubaiNov 3, 2025

Prompting Autonomous Agents and LLMs in Energy Operations, Efficiency Gains or Hidden Liabilities?

This paper evaluates the deployment of LLMs and agentic AI in the energy industry, focusing on automating tasks like reporting, compliance, and cyber-defense. It uses a structured evaluation framework to classify outputs based on traceability, reproducibility, and hallucination risk, comparing human-led interactions with autonomous agent loops. The study finds that while LLMs improve efficiency, they introduce governance risks due to lack of validation and unclear boundaries between assistance and autonomous recommendation, potentially leading to acceptance of fabricated content.

Introduces a risk-graded framework for evaluating agentic LLM outputs in energy operations, linking LLM traceability with legal auditability.

Alessio Faccia

Tool Use & AgentsConstitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

Nov 1, 2025

Diagnosing Hallucination Risk in AI Surgical Decision-Support: A Sequential Framework for Sequential Validation

This paper introduces a clinician-centered framework to quantify hallucination risks in LLMs used for spine surgery decision support, evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. Six LLMs were assessed across 30 expert-validated spinal cases, revealing that DeepSeek-R1 outperformed others, and reasoning-enhanced models did not consistently improve performance. Multidimensional stress-testing exposed model-specific vulnerabilities, particularly a decline in recommendation quality under amplified complexity, highlighting the need for interpretability mechanisms.

Proposes a novel, multi-dimensional framework for evaluating and quantifying hallucination risks in LLMs for surgical decision support, focusing on clinically relevant aspects like diagnostic precision and recommendation quality.

Dong Chen, Yanzhe Wei, Zonglin He +72511.00588

Eval Frameworks & BenchmarksRed-Teaming & Adversarial RobustnessConstitutional AI & AI Ethics

Oct 30, 2025

Tennessee Tech UniversityOct 30, 2025

LLM-based Multi-class Attack Analysis and Mitigation Framework in IoT/IIoT Networks

This paper introduces a hybrid framework that combines ML-based multi-class attack detection with LLMs for attack behavior analysis and mitigation in IoT/IIoT networks. The authors employ structured role-play prompt engineering with RAG to guide ChatGPT-o3 and DeepSeek-R1 in producing detailed, context-aware responses for attack analysis and mitigation. They propose novel quantitative evaluation metrics and use an ensemble of judge LLMs to independently assess the responses, demonstrating that Random Forest performs best for attack detection and ChatGPT-o3 outperforms DeepSeek-R1 in attack analysis and mitigation.

Introduces a novel framework for quantitative evaluation of LLM-based attack analysis and mitigation in IoT/IIoT networks, using an ensemble of judge LLMs and novel metrics.

Seif Ikbarieh, Maanak Gupta, Elmahedi Mahalal2510.26941

Red-Teaming & Adversarial RobustnessEval Frameworks & BenchmarksNatural Language Processing

Oct 28, 2025

Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents

The authors introduce Cybersecurity AI Benchmark (CAIBench), a modular meta-benchmark framework, to evaluate LLM-based cybersecurity agents across offensive and defensive domains. CAIBench integrates five evaluation categories, including CTFs, cyber range exercises, knowledge benchmarks, and privacy assessments, to address the limitations of existing benchmarks that assess isolated skills. Experiments with state-of-the-art AI models reveal a performance gap between security knowledge and adaptive capabilities, particularly in multi-step adversarial scenarios and robotic targets, highlighting the importance of a meta-benchmark approach.

Introduces CAIBench, a novel meta-benchmark framework for evaluating cybersecurity AI agents across diverse offensive and defensive tasks, including robotics and privacy assessments.

Mar'ia Sanz-G'omez, V. Vilches, Francesco Balassone +32510.24317

Eval Frameworks & BenchmarksRed-Teaming & Adversarial RobustnessTool Use & Agents

Oct 27, 2025

Medical Vision-Language Pre-training with Multimodal Variational Masked Autoencoder for Robust Medical VQA

The paper introduces Multimodal Variational Masked Autoencoder (MVMAE), a pre-training framework for Medical VQA designed to improve robustness against adversarial attacks. MVMAE employs masked modeling and variational inference with a multimodal bottleneck fusion module and reparameterization to extract robust latent representations. Experiments on medical VQA datasets show that MVMAE significantly improves resistance to adversarial attacks compared to other pre-training methods.

Introduces a novel multimodal variational masked autoencoder (MVMAE) pre-training framework that enhances the robustness of medical VQA models against adversarial attacks.

Dexuan Xu, Yanyuan Chen, Yu Huang +5

Multimodal ModelsRed-Teaming & Adversarial RobustnessComputer Vision

Oct 27, 2025

Collaboration Wins More: Dual-Modal Collaborative Attention Reinforcement for Mitigating Large Vision Language Models Hallucination

The paper addresses the problem of hallucination in Large Vision-Language Models (LVLMs) by proposing a Dual-Modal Collaborative Attention Reinforcement (DuCAR) method. DuCAR uses intra-visual CLS-driven sampling and cross-modal dynamic sampling to extract important visual tokens, and then adaptively enhances the attention weights of these tokens during multimodal fusion. Experiments on POPE and CHAIR benchmarks demonstrate that DuCAR outperforms existing methods in mitigating hallucinations.

Introduces a dual-modal collaborative attention reinforcement (DuCAR) method to mitigate hallucinations in LVLMs by reinforcing informative visual tokens and suppressing attention dispersion.

Jiye Xie, Yifei Gao, Liangliang You +12

Multimodal ModelsRed-Teaming & Adversarial RobustnessEval Frameworks & Benchmarks

Oct 19, 2025

Polytechnique MontréalOct 19, 2025

PRISM-AI: A Dual-Stage Neuro-Symbolic Agentic Framework for Privacy Risk Mitigation in LLMs

The paper introduces PRISM-AI, a neuro-symbolic multi-agent framework that combines a symbolic rule engine (LogicMP) with a neural agent to mitigate privacy risks in LLM inference, enforcing GDPR and Act 25 constraints. PRISM-AI uses a dual-stage privacy control mechanism to evaluate both user prompts and LLM outputs, proactively and reactively filtering sensitive content. Experiments across diverse domains show LogicMP achieves higher accuracy (82.5%) and efficiency (2,806x faster, 100x lower memory) compared to LLM-based detection, alongside a 29.3% precision advantage, demonstrating the benefits of neuro-symbolic integration for privacy protection.

Introduces a novel dual-stage neuro-symbolic agentic framework, PRISM-AI, that integrates symbolic reasoning with neural agents to enhance privacy risk mitigation in LLMs by evaluating both inputs and outputs.

Sabrine Amri, Nora Boulahia-Cuppens, Frédéric Cuppens

Constitutional AI & AI EthicsRed-Teaming & Adversarial RobustnessTool Use & Agents

Oct 13, 2025

Bag of Tricks for Subverting Reasoning-based Safety Guardrails

The paper identifies a vulnerability in reasoning-based safety guardrails for Large Reasoning Models (LRMs) where subtle manipulations of input prompts, such as adding template tokens, can bypass the guardrails and elicit harmful responses. They introduce a "bag of tricks" jailbreak methods, including template manipulations and automated optimization, that successfully subvert these guardrails in white-, gray-, and black-box settings. Experiments on open-source LRMs demonstrate high attack success rates (over 90% on gpt-oss series) across various benchmarks, highlighting the systemic nature of the vulnerability and the need for improved alignment techniques.

Reveals the fragility of reasoning-based safety guardrails in LRMs by demonstrating that simple prompt manipulations can effectively bypass them, leading to potentially harmful outputs.

Shuo Chen, Zhen Han, Haokun Chen +62510.11570

Red-Teaming & Adversarial RobustnessReasoning & Chain-of-ThoughtConstitutional AI & AI Ethics

Oct 7, 2025

Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment

The paper introduces RLHF-COV and DPO-COV algorithms designed to simultaneously address corrupted preference data, reward overoptimization, and verbosity biases in aligning LLMs with human preferences. The algorithms achieve this by incorporating length regularization and leveraging theoretical guarantees on generalization error rates, even with corrupted data. The authors prove the equivalence of RLHF-COV and DPO-COV, mirroring the known equivalence of vanilla RLHF and DPO, and demonstrate the effectiveness of DPO-COV in offline and online settings.

Introduces and theoretically justifies RLHF-COV and DPO-COV algorithms that provably mitigate corruption, overoptimization, and verbosity biases in both offline and online RLHF/DPO alignment.

Ziyi Chen, Junyi Li, Peiran Yu +12510.05526

RLHF & Preference LearningConstitutional AI & AI EthicsRed-Teaming & Adversarial Robustness

Sep 26, 2025

OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

The paper red-teams OpenAI's GPT-OSS-20B model in Hausa, a low-resource language, to evaluate its safety alignment. It demonstrates that minimal prompting can induce the model to generate harmful, culturally insensitive, and factually inaccurate content, particularly when using polite language that exploits reward hacking. The study reveals critical vulnerabilities, including the model's false assumptions about the safety of common toxins and its inability to distinguish between raw and processed foods, highlighting the need for improved safety tuning in low-resource languages.

Demonstrates that OpenAI's GPT-OSS-20B model exhibits significant safety alignment failures and biases when used in Hausa, a low-resource language, due to insufficient safety tuning.

Isa Inuwa-Dutse2510.01266

Red-Teaming & Adversarial RobustnessConstitutional AI & AI EthicsOpen-Source Models & Weights

Lattice is designed for desktop

Red-Teaming & Adversarial Robustness

Keywords

Top Labs in This Topic

Recent Papers

Lattice is designed for desktop

Red-Teaming & Adversarial Robustness

Keywords

Top Labs in This Topic

Recent Papers

Search