April 24 – May 1, 2026

Red-Teaming & Adversarial Robustness - Weekly Roundup

100 papers published across 6 labs.

Selected Labs publishing this week

NUS2 Microsoft Research1 AI21 Stanford HAI1 CMU ML1

Top Papers

Apr 28, 2026

Stanford HAI3w ago·also CMU ML, UT Austin

The Dynamics of Delusion: Modeling Bidirectional False Belief Amplification in Human-Chatbot Dialogue

Chatbots don't just reflect human delusions; they actively amplify and sustain them over time through a dominant self-influence pathway.

Ashish Mehta, Jared Moore, J. R. Anthis +6

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

May 1, 2026

Daniel Song +233w ago

Code World Model Preparedness Report

Meta's risk assessment of its Code World Model (CWM) gives it a clean bill of health, concluding it poses no *new* catastrophic risks beyond those already present in the AI landscape.

Daniel Song, Peter Ney, Cristina Menghini +21

Code Generation & Program Synthesis Open-Source Models & Weights Red-Teaming & Adversarial Robustness

Minchan Kwon +53w ago

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

Red-teaming LLMs just got more robust: Stable-GFN sidesteps GFN's notorious instability, unlocking more diverse and effective attacks.

Minchan Kwon, Sunghyun Baek, Minseo Kim +3

Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Venkata Pushpak Teja Menta3w ago

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

Speaker embeddings leak script information, especially when projecting Western voices into Indic scripts, but LASE fixes this with a language-adversarial training objective.

Venkata Pushpak Teja Menta

Natural Language Processing Red-Teaming & Adversarial Robustness Speech & Audio

Apr 30, 2026

Emma Andrews +53w ago

Defending Quantum Classifiers against Adversarial Perturbations through Quantum Autoencoders

Quantum autoencoders can purify adversarial examples, boosting the robustness of quantum classifiers by up to 68% without adversarial training.

Emma Andrews, Emma Andrews, Sahan Sanjaya +3

Computer Vision Red-Teaming & Adversarial Robustness

All Papers (100)

May 1, 2026

Daniel Song +233w ago

Code World Model Preparedness Report

Meta's risk assessment of its Code World Model (CWM) gives it a clean bill of health, concluding it poses no *new* catastrophic risks beyond those already present in the AI landscape.

Daniel Song, Peter Ney, Cristina Menghini +21

Code Generation & Program Synthesis Open-Source Models & Weights Red-Teaming & Adversarial Robustness

Minchan Kwon +53w ago

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

Red-teaming LLMs just got more robust: Stable-GFN sidesteps GFN's notorious instability, unlocking more diverse and effective attacks.

Minchan Kwon, Sunghyun Baek, Minseo Kim +3

Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Venkata Pushpak Teja Menta3w ago

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

Speaker embeddings leak script information, especially when projecting Western voices into Indic scripts, but LASE fixes this with a language-adversarial training objective.

Venkata Pushpak Teja Menta

Natural Language Processing Red-Teaming & Adversarial Robustness Speech & Audio

Apr 30, 2026

Emma Andrews +53w ago

Defending Quantum Classifiers against Adversarial Perturbations through Quantum Autoencoders

Quantum autoencoders can purify adversarial examples, boosting the robustness of quantum classifiers by up to 68% without adversarial training.

Emma Andrews, Emma Andrews, Sahan Sanjaya +3

Computer Vision Red-Teaming & Adversarial Robustness

Han Liu +33w ago

Low Rank Adaptation for Adversarial Perturbation

Adversarial perturbations in LLMs have an exploitable low-rank structure, enabling more efficient and effective black-box attacks.

Han Liu, Shanghao Shi, Yevgeniy Vorobeychik +1

Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

Clemson University3w ago

Understanding Adversarial Transferability in Vision-Language Models for Autonomous Driving: A Cross-Architecture Analysis

Architectural diversity offers surprisingly little defense against adversarial attacks on VLMs for autonomous driving, with physical patches transferring effectively across different models.

David Fernandez, Pedram MohajerAnsari, Amir Salarpour +2

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Anietta Weckauff +43w ago

Characterizing the Consistency of the Emergent Misalignment Persona

Emergent misalignment can lead to "inverted-persona" LLMs that confidently identify as aligned AI systems while consistently generating harmful outputs.

Anietta Weckauff, Anietta Weckauff, Yuchen Zhang +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Behnaz Ranjbar +73w ago·also Colorado State University

Focus Session: Autonomous Systems Dependability in the era of AI: Design Challenges in Safety, Security, Reliability and Certification

AI's non-determinism and data-dependence create critical gaps in the verification, validation, and certification of safety-critical autonomous systems.

Behnaz Ranjbar, Kirankumar Raveendiran, S. Pasricha +5

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Robotics & Embodied AI

Southern Illinois University3w ago·also Rajshahi University of Engineering &

Emotion-Aware Clickbait Attack in Social Media

Emotionally charged clickbait can now evade detection by existing systems with up to a 30% higher success rate, thanks to a new generation technique that optimizes for Valence-Arousal-Dominance.

S. M. Hasan, Syed Mhamudul Hasan, Mohd. Farhan Israk Soumik +2

Natural Language Processing Red-Teaming & Adversarial Robustness

3w ago

VOW: Verifiable and Oblivious Watermark Detection for Large Language Models

Watermarking LLMs doesn't have to sacrifice privacy: VOW lets you verify machine-generated text without revealing the content to a central authority.

Xiaokun Luan, Yihao Zhang, Pengcheng Su +2

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

3w ago·also Milwaukee School of Engineering

Static Attribution of Android Residential Proxy Malware Using Graph Kernels

Achieve near-perfect attribution of Android residential proxy malware by fusing graph kernel features with binary capabilities, even amidst code reuse and obfuscation.

P. Clark, Peter Clark, Yong Guan +1

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness

Carmine Cesarano +23w ago·also KTH

The Grand Software Supply Chain of AI Systems

AI systems are built on a software house of cards, with 400M lines of code and 11,000 dependencies, yet lack basic supply chain protections like versioning and verifiability.

Carmine Cesarano, Martin Monperrus, M. Monperrus

Constitutional AI & AI Ethics Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness

Feeza Khan Khanzada +13w ago

Dreaming Across Towns: Semantic Rollout and Town-Adversarial Regularization for Zero-Shot Held-Out-Town Fixed-Route Driving in CARLA

Semantic rollouts and town-adversarial regularization can significantly boost zero-shot driving performance in unseen CARLA towns, even without explicit navigation commands or map inputs.

Feeza Khan Khanzada, Jaerock Kwon

Red-Teaming & Adversarial Robustness Robotics & Embodied AI World Models & Planning

Philipp Czerner +63w ago·also TU Munich

Monadic Presburger Predicates have Robust Population Protocols

Robustly deciding even simple arithmetic predicates in distributed systems comes at a steep cost: state complexity explodes double-exponentially.

Philipp Czerner, Javier Esparza, V. Fischer +4

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Sharayu Nilesh Deshmukh +53w ago

Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Current DeepFake detectors can be fooled by semantically inconsistent real audio and video, highlighting a critical blind spot in their ability to assess realistic manipulations.

Sharayu Nilesh Deshmukh, Kailash A. Hambarde, Joana C. Costa +3

Computer Vision Red-Teaming & Adversarial Robustness Speech & Audio

Yanting Wang +33w ago

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Red-teaming long-context LLMs just got a whole lot cheaper: FlashRT slashes the compute and memory costs of prompt injection attacks by up to 7x.

Yanting Wang, Chenlong Yin, Ying Chen +1

Inference & Quantization Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

3w ago·also Bristol, Leiden

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

Control knobs for LLM safety exist: MASCing lets you steer MoE behavior *without* costly retraining, boosting jailbreak defense by up to 89.2% and adult content generation control by up to 93.0%.

Jona te Lintelo, Lichao Wu, Marina Krček +5

Architecture Design (Transformers, SSMs, MoE)Red-Teaming & Adversarial Robustness

Iqra Aslam +73w ago·also Clausthal University of Technology

Connected Dependability Cage: Run-Time Function and Anomaly Monitoring for the Development and Operation of Safe Automated Vehicles

Automated vehicles can achieve fail-operational capabilities by using a hierarchical monitoring framework that combines functional consistency checks with anomaly detection to handle system failures and unfamiliar scenarios.

Iqra Aslam, Nour Habib, Nouran Habib +5

Computer Vision Red-Teaming & Adversarial Robustness Robotics & Embodied AI

Eyon Jang +173w ago

Exploration Hacking: Can LLMs Learn to Resist RL Training?

LLMs can learn to strategically sabotage their own reinforcement learning, resisting capability elicitation while maintaining task performance.

Eyon Jang, Eyon Jang, Damon Falck +15

Red-Teaming & Adversarial Robustness RLHF & Preference Learning Scalable Oversight & Alignment Theory

Microsoft Research3w ago

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

TwinGate stops jailbreaks by tracking malicious intent across anonymized, interleaved queries with minimal overhead, something previous defenses couldn't do.

Bowen Sun, Chaozhuo Li, Yaodong Yang +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Red-Teaming & Adversarial Robustness

Zehui Tang +33w ago·also MIIT Key Laboratory of Pattern Analysis, NJU

AdaBFL: Multi-Layer Defensive Adaptive Aggregation for Bzantine-Robust Federated Learning

Adaptively weighting defenses in federated learning lets you robustly handle diverse attacks without needing the dataset on the server.

Zehui Tang, Yuchen Liu, F. Huang +1

Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

Prashant Kulkarni +13w ago

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

LLMs betray prompt injection attacks with a tell-tale "restlessness" in their activation trajectories, detectable even when individual turns appear harmless.

Prashant Kulkarni, Prashant Kulkarni

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Mohd Sameen Chishti +23w ago

Test Before You Deploy: Governing Updates in the LLM Supply Chain

Silent LLM updates can break your application in unexpected ways, but this governance framework offers a deployer-side solution to catch regressions before they hit production.

Mohd Sameen Chishti, Damilare Peter Oyinloye, Jingyue Li

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Hiroyuki Deguchi +23w ago

One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

A single, optimized text snippet can fool CLIP into thinking it's a good caption for almost any image, revealing a surprising vulnerability in cross-modal understanding.

Hiroyuki Deguchi, Katsuki Chousa, Yusuke Sakai

Multimodal Models Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

Luyao Xu +13w ago

Security Attack and Defense Strategies for Autonomous Agent Frameworks: A Layered Review with OpenClaw as a Case Study

Autonomous LLM agents are vulnerable to cascading security failures across context, tools, state, and ecosystem layers, demanding a more holistic defense strategy.

Luyao Xu, Xiang Chen

Red-Teaming & Adversarial Robustness Tool Use & Agents

Zi Li +63w ago

Secret Stealing Attacks on Local LLM Fine-Tuning through Supply-Chain Model Code Backdoors

You can steal secrets from locally fine-tuned LLMs by backdooring their model code, even bypassing common defenses like differential privacy and code audits.

Zi Li, Tianyang Zhou, Tian Zhou +4

Code Generation & Program Synthesis Open-Source Models & Weights Red-Teaming & Adversarial Robustness

Apr 29, 2026

Mississippi State University Starkville3w ago

Adaptive and AI-Augmented Security Testing: A Systematic Survey of Program Analysis, Feedback-Driven Testing, and Hybrid Learning-Based Approaches

Security testing is fragmented: program analysis and adaptive testing operate largely in isolation, missing opportunities to leverage structural insights for more effective vulnerability detection.

Michael Wienczkowski

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness

University3w ago

Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents

LLM agents can be made dramatically more secure with a simple trick: constrain their behavior to known-good tool-use trajectories.

Hung Dang

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Kyushu Institute of Technology Iizuka3w ago

Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

LLMs fail over half the time when asked to perform harmful actions in a simulated robotic health attendant setting, even when fine-tuned on medical data.

Mahiro Nakao, Kazuhiro Takemoto

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Robotics & Embodied AI

3w ago·also Tencent AI

Diffusion Reconstruction towards Generalizable Audio Deepfake Detection

Audio deepfake detectors trained on diffusion-reconstructed "hard" examples generalize far better to unseen attacks, slashing error rates compared to standard training.

Bo Cheng, Songjun Cao, Xiaoming Zhang +3

Red-Teaming & Adversarial Robustness Speech & Audio

3w ago·also CAS, SJTU

Dual-LoRA: Parameter-Efficient Adversarial Disentanglement for Cross-Lingual Speaker Verification

Adversarial training doesn't have to hurt speaker verification: by explicitly modeling language, you can disentangle speaker and language characteristics without sacrificing speaker discriminability.

Qituan Shangguan, Junhao Du, Kunyang Peng +4

Natural Language Processing Red-Teaming & Adversarial Robustness Speech & Audio

Senior Data Scientist3w ago

When Roles Fail: Epistemic Constraints on Advocate Role Fidelity in LLM-Based Political Statement Analysis

LLMs in multi-agent systems often abandon their assigned roles due to "Epistemic Role Override," undermining the intended diversity of perspectives in political statement analysis.

Juergen Dietrich

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Jon-Paul Cacioli3w ago

Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation

Complex, multi-step instructions can cause LLMs to completely ignore question content and instead rely on positional shortcuts when asked to underperform, revealing a critical vulnerability in adversarial evaluation.

Jon-Paul Cacioli

Eval Frameworks & Benchmarks Open-Source Models & Weights Red-Teaming & Adversarial Robustness

AI23w ago

Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations

LLMs often withhold helpful information due to misinterpreting user intent, but multi-turn conversations can unlock utility—at a cost of new failure modes like "utility lock-in" and "unsafe recovery" that single-turn benchmarks miss.

Mingqian Zheng, Malia Morgan, Liwei Jiang +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Department of Computer Science3w ago·also Department of Computing, Imperial, University of Camerino

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

LLMs will strategically feign alignment by picking the "safe" tool only when they think you're watching, revealing a new attack surface beyond conversational settings.

Matteo Leonesi, Francesco Belardinelli, Flavio Corradini +1

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago·also HKUST, SUSTech, Westlake

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

LLM-based peer review systems can be made significantly more robust against adversarial manipulation via a co-evolutionary GAN approach that anticipates novel attacks.

Yuan Xin, Yixuan Weng, Minjun Zhu +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Neha Nagaraja +23w ago

From Prompt to Physical Actuation: Holistic Threat Modeling of LLM-Enabled Robotic Systems

LLM-controlled robots are surprisingly vulnerable: a single compromised input can cascade through the system, bypassing safety measures and leading to dangerous physical actions.

Neha Nagaraja, Hayretdin Bahsi, Carlo R. da Cunha

Red-Teaming & Adversarial Robustness Robotics & Embodied AI Tool Use & Agents

Department of Electronics and Communications3w ago·also Ain Shams University, Air Defense College, Military Academy, The Egyptian Technical Research and Development +1

Can Cross-Layer Design Bridge Security and Efficiency? A Robust Authentication Framework for Healthcare Information Exchange Systems

By fusing cryptographic and physical-layer device characteristics, this authentication scheme slashes computational overhead while fortifying healthcare networks against impersonation and eavesdropping.

Khalid M. Ezzat, Muhammad El-Saba, Mahmoud A. Shawky

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

3w ago

SafeTune: Mitigating Data Poisoning in LLM Fine-Tuning for RTL Code Generation

Defend against hardware Trojans in LLM-generated RTL code by structurally and semantically verifying training data, without needing to alter the underlying LLM.

Mahshid Rezakhani, Nowfel Mashnoor, Kimia Azar +1

Code Generation & Program Synthesis Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness

Independent Researcher3w ago·also Helmholtz, University of Louisiana

Indirect Prompt Injection in the Wild: An Empirical Study of Prevalence, Techniques, and Objectives

Prompt injection isn't just a theoretical threat: over 15,000 instances are already lurking on the web, ready to hijack LLMs browsing the internet.

Soheil Khodayari, Xuenan Zhang, Bhupendra Acharya +1

Natural Language Processing Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago

Enhancing Linux Privilege Escalation Attack Capabilities of Local LLM Agents

Local LLMs can now rival cloud-based giants like GPT-4o in Linux privilege escalation tasks, thanks to targeted system-level and prompting interventions.

Benjamin Probst, Andreas Happe, Jürgen Cito

Open-Source Models & Weights Red-Teaming & Adversarial Robustness Tool Use & Agents

University of Missouri -Columbia3w ago

Formulating Subgroup Discovery as a Quantum Optimization Problem for Network Security

Quantum computing can surface critical network attack patterns that classical methods miss, achieving up to 99.6% test precision on unique subgroups.

Samuel Spell, Chi-Ren Shyu

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Nyx Foundation3w ago·also Aichi Prefectural Aichi High School of Technology, Kyoto

Beyond Code Reasoning: A Specification-Anchored Audit Framework for Expert-Augmented Security Verification

Code-level security audits miss vulnerabilities arising from specification requirements, but SPECA finds them by reasoning directly from natural language specs.

Masato Kamba, Hirotake Murakami, Akiyoshi Sannai

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Ben-Gurion University of the Negev3w ago

SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting with Tri-Context Personalization

Forget generic chatbots – SecMate slashes cybersecurity troubleshooting failures by 40% simply by adding device-specific diagnostics.

Yair Meidan, Omri Haller, Yulia Moshan +4

Natural Language Processing Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago·also D2 any-refusal is 1.000 early, SDU

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

Safety training doesn't just make models refuse more, it fundamentally *reorganizes* where and how those refusals happen inside the network.

Wenhao Lan, Shan Li, Junbin Yang +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

3w ago·also Lappeenranta-Lahti University of Technology

A Multi-Level Integrity Evaluation Framework for Quantum Circuits under Controlled Anomaly Injection

Structural similarity can be dangerously misleading in quantum circuits: even with 95% structural integrity, behavioral anomalies can be rampant.

Ejaz Ahmed, Boshuai Ye, Syed Hamza Shah +2

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

NUS3w ago·also NTU, UNSW

Membership Inference Attacks Against Video Large Language Models

VideoLLMs leak training data: a novel black-box attack recovers membership with surprisingly high accuracy (AUC=0.68) by probing generation brittleness across temperatures.

Wei Song, Yuxin Cao, Ziqi Ding +3

Data Curation & Synthetic Data Multimodal Models Red-Teaming & Adversarial Robustness

3w ago

An Empirical Security Evaluation of LLM-Generated Cryptographic Rust Code

LLMs fail to generate secure cryptographic code the vast majority of the time, with 57% of compiled samples containing exploitable vulnerabilities like nonce reuse.

Mohamed Elsayed, Kenneth Fulton, Jeong Yang

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

University of Cagliari Cagliari3w ago·also UCL

Comparing Smart Contract Paradigms: A Preliminary Study of Security and Developer Experience

Resource-oriented smart contract languages like Move cut security code by 60%, suggesting a path to safer DeFi even if it means writing more code.

Matteo Vaccargiu, Andrea Pinna, Maria Ilaria Lunesu +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Apr 28, 2026

Harry Collins +43w ago

Large language models eroding science understanding: an experimental study

LLMs can be easily manipulated to confidently disseminate fringe scientific theories, even when those theories contradict established scientific consensus.

Harry Collins, Hartmut Grote, Paul Newbury +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Warsaw University of Technology3w ago·also Center on Long-Term Risk, Constellation, NASK National Research Institute, Truthful AI +1

Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers

Even after safety interventions, language models can still harbor emergent misalignment, lying dormant until triggered by subtle contextual cues reminiscent of their training data.

Jan Dubiński, Jan Betley, Anna Sztyber-Betley +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Vinith M. Suriyakumar +73w ago

Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM

You can now detect harmful specializations in generative models, like those trained on CSAM, without ever generating a single risky output.

Vinith M. Suriyakumar, Ayush Sekhari, Lena Stempfle +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago

Cross-Lingual Jailbreak Detection via Semantic Codebooks

Jailbreak defenses relying on semantic similarity crumble when faced with diverse, real-world multilingual attacks, even if they ace the textbook examples.

Shirin Alanova, Bogdan Minko, Sabrina Sadiekh +1

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

3w ago

Towards Agentic Investigation of Security Alerts

LLMs can be surprisingly effective security analysts, triaging alerts with significantly improved accuracy when guided by structured queries and constrained tool access.

Even Eilertsen, Vasileios Mavroeidis, Gudmund Grov

Natural Language Processing Red-Teaming & Adversarial Robustness Tool Use & Agents

Mainak Sen +23w ago

PHISHREV: A Hybrid Machine Learning and Post-Hoc Non-monotonic Reasoning Framework for Context-Aware Phishing Website Classification

Expert knowledge can be injected into phishing detection systems to correct ML model errors and improve consistency, without the need for retraining.

Mainak Sen, Kumar Sankar Ray, Amlan Chakrabarti

Natural Language Processing Red-Teaming & Adversarial Robustness

3w ago·also CAS

Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models

LVLMs hallucinate less when you intervene *before* they start generating, by cleaning up the initial Key-Value cache with modality-aware steering vectors.

Chengsheng Zhang, Chenghao Sun, Xinyan Jiang +1

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

IMATAG3w ago·also IRISA

The Forensic Cost of Watermark Removal

Watermark removal methods may fool the eye, but they leave behind statistical fingerprints that are easily detectable by a forensic classifier.

Gautier Evennou, Ewa Kijak

Computer Vision Red-Teaming & Adversarial Robustness

University of Malaya Malaysia3w ago

Medoid Prototype Alignment for Cross-Plant Unknown Attack Detection in Industrial Control Systems

Aligning medoid prototypes of ICS traffic enables robust transfer learning for intrusion detection, even when faced with unseen attacks and significant domain shift between industrial plants.

Luyao Wang

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Jon-Paul Cacioli3w ago

Below-Chance Blindness: Prompted Underperformance in Small LLMs Produces Positional Bias Rather than Answer Avoidance

Forget sophisticated deception – small LLMs "sandbagging" on tests just pick option 'E' or 'F' regardless of the question, revealing a surprising positional bias instead of true answer-aware avoidance.

Jon-Paul Cacioli

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Pei-ke Zhu +13w ago

ValueAlpha: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable

LLM-judged investment rationales reward verbosity and confidence over actual financial insight, penalizing concise, correct reasoning by nearly 3 points.

Pei-ke Zhu, Yuxiao Chen

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago

Subliminal Steering: Stronger Encoding of Hidden Signals

Subliminal learning can transfer not just behaviors, but the underlying steering vectors themselves, revealing a surprisingly precise encoding mechanism.

George Morgulis, John Hewitt

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Stanford HAI3w ago·also CMU ML, UT Austin

The Dynamics of Delusion: Modeling Bidirectional False Belief Amplification in Human-Chatbot Dialogue

Chatbots don't just reflect human delusions; they actively amplify and sustain them over time through a dominant self-influence pathway.

Ashish Mehta, Jared Moore, J. R. Anthis +6

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Lijia Lv +43w ago

Structured Security Auditing and Robustness Enhancement for Untrusted Agent Skills

Pre-load auditing of Agent Skills can achieve >97% accuracy in detecting malicious intent, even against semantics-preserving rewrites, by combining role-aware evidence extraction with semantic verification.

Lijia Lv, Xuehai Tang, Jie Wen +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Xueying Zeng +63w ago

MARD: A Multi-Agent Framework for Robust Android Malware Detection

LLMs can orchestrate existing static analysis tools to achieve state-of-the-art Android malware detection at a fraction of the cost, without any domain-specific fine-tuning.

Xueying Zeng, Youquan Xian, Sihao Liu +4

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Red-Teaming & Adversarial Robustness

Luis-Armando Rodr'iguez-Flores +33w ago

Secure Conformance Checking using Token-based Replay and Homomorphic Encryption

Verify process conformance without revealing sensitive log data using homomorphic encryption.

Luis-Armando Rodr'iguez-Flores, Luciano Garc'ia-Banuelos, Abel Armas-Cervantes +1

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

3w ago·also USTC

ReTokSync: Self-Synchronizing Tokenization Disambiguation for Generative Linguistic Steganography

Achieve near-perfect covert communication even when tokenizers disagree, by selectively patching up tokenization mismatches on the fly.

Yaofei Wang, Weilong Pang, JiaLiang Han +3

Natural Language Processing Red-Teaming & Adversarial Robustness

Ziming Zhang +53w ago·also USC

R-CoT: A Reasoning-Layer Watermark via Redundant Chain-of-Thought in Large Language Models

Watermarking LLMs by embedding the signal into the reasoning process itself proves surprisingly robust against fine-tuning and other post-training modifications.

Ziming Zhang, Li Li, Guorui Feng +3

Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Minh-Khoa Le-Phan +33w ago

Robust Deepfake Detection: Mitigating Spatial Attention Drift via Calibrated Complementary Ensembles

Deepfake detectors can be made far more robust to real-world image corruptions by training on heavily degraded data and ensembling complementary feature streams.

Minh-Khoa Le-Phan, Minh-Hoang Le, Trong-Le Do +1

Computer Vision Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness

Tensor AI Solutions GmbH3w ago·also DLR, Hensoldt Sensors GmbH, Ulm University

Quantum-Inspired Robust and Scalable SAR Object Classification

Tensor networks offer a surprisingly robust and efficient alternative to traditional neural networks for classifying noisy SAR imagery, even under data poisoning attacks.

Maximilian Scharf, Marco Trenti, Felix Bock +5

Computer Vision Inference & Quantization Red-Teaming & Adversarial Robustness

Jiaqi Wu +73w ago

When the Forger Is the Judge: GPT-Image-2 Cannot Recognize Its Own Faked Documents

GPT-Image-2 can so seamlessly forge documents that neither humans nor the model itself can reliably tell the difference.

Jiaqi Wu, Yuchen Zhou, Dennis Ng +5

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Ravikumar Balakrishnan +13w ago

One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations

Cranking up the visual similarity between prompt images and text embeddings isn't just about readability for VLMs, it's a potent jailbreak that simultaneously unlocks readability and slips past safety filters.

Ravikumar Balakrishnan, Sanket Mendapara

Constitutional AI & AI Ethics Multimodal Models Red-Teaming & Adversarial Robustness

3w ago·also NUS, KCL, USTC, ZJU

SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

You can detect prompt injection attacks in screenshot-based web agents with 8x speedup and no extra memory by looking for telltale visual "smoothness" and reversed text polarity.

Mengyao Du, Han Fang, Haokai Ma +3

Multimodal Models Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago

Robustness Evaluation of a Foundation Segmentation Model Under Simulated Domain Shifts in Abdominal CT: Implications for Health Digital Twin Deployment

Despite concerns about domain shift in medical imaging, SAM (ViT-B) demonstrates surprisingly robust spleen segmentation in abdominal CT scans even under simulated inter-scanner variations.

Sanghati Basu

Computer Vision Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Tom Neubert +43w ago·also Aeronautical University Daytona

Threat-Oriented Digital Twinning for Security Evaluation of Autonomous Platforms

A novel digital twin framework enables rigorous cybersecurity testing of autonomous platforms, translating threat analysis into actionable, observable tests.

Tom Neubert, Thomas J. Neubert, Laxima Niure Kandel +2

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Robotics & Embodied AI

A.J. Mazza +33w ago

BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate

Forget expensive human labeling: BARRED lets you train custom policy guardrails that outperform state-of-the-art LLMs using only synthetic data generated via multi-agent debate.

A.J. Mazza, Arnon Mazza, Elad Levi +1

Constitutional AI & AI Ethics Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness

Apr 27, 2026

Emaan Bilal Khan +33w ago

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

Fine-tuning your LLM can drastically alter its safety profile in unpredictable ways, even turning safe models unsafe.

Emaan Bilal Khan, Amy Winecoff, Miranda Bogen +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Kushal Raj Bhandari +43w ago

Improving Robustness of Tabular Retrieval via Representational Stability

Seemingly innocuous choices in table serialization format (CSV vs. HTML) can drastically alter retrieval performance, but a simple centroid-based correction can restore semantic consistency.

Kushal Raj Bhandari, Adarsh Singh, Jianxi Gao +2

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

Miao Lin +43w ago

Laplace-Bridged Randomized Smoothing for Fast Certified Robustness

Edge devices can now achieve up to 494x faster certified robustness with Laplace-Bridged Smoothing, making formally verified AI deployments practical in resource-constrained settings.

Miao Lin, MD Saifur Rahman Mazumder, Fengyi Yu +2

Inference & Quantization Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

BAIR3w ago·also Melbourne, UIUC, University of California, University of Georgia

Green Shielding: A User-Centric Approach Towards Trustworthy AI

LLMs exhibit Pareto-like tradeoffs in medical diagnosis, where neutralizing user prompts to improve plausibility and conciseness can simultaneously reduce coverage of critical conditions.

Aaron Li, Nicola Sanchez, Hao Huang +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Nay Myat Min +23w ago

Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models

A single, tuning-free "health signal" derived from layer activations can catch backdoors, jailbreaks, and prompt injections in LLMs, even without a clean reference model.

Nay Myat Min, Long H. Pham, Jun Sun

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

O. Delaney +43w ago

Risk Reporting for Developers'Internal AI Model Use

Frontier AI companies need a standardized risk reporting framework for internal model use, and this paper provides one structured around autonomous AI misbehavior and insider threats.

O. Delaney, Sambhav Maheshwari, Joe O'Brien +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Xiaohang Yu +23w ago

SUDP: Secret-Use Delegation Protocol for Agentic Systems

Stop handing over the keys to the kingdom: SUDP lets agents use your secrets without ever actually seeing them, preventing prompt injection from turning into full account takeover.

Xiaohang Yu, Hejia Geng, William J. Knottenbelt

Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago

Extended Abstract: Shaperd: Easily Adoptable Real-Time Traffic Shaper for Fully Encrypted Protocols

Traffic shaping can be both powerful and practical: Shaperd lets you customize encrypted traffic flows in real-time to evade censorship without killing throughput.

Sarah Wilson, Stella Tian, Sina Kamali

Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

Qi Li +103w ago

A Comparative Evaluation of AI Agent Security Guardrails

DKnownAI Guard blows away AWS, Azure, and Lakera in head-to-head security tests for AI agents.

Qi Li, Jiu Li, Pingtao Wei +8

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Nanqing Luo +53w ago

Detecting Avalanche Effect in Adversarial Settings: Spotting the Encryption Loops in Ransomware

Existing ransomware detection methods only check for "ripple effects" of encryption, but this new approach statistically guarantees detection of the avalanche effect itself, even in the face of obfuscation.

Nanqing Luo, Xusheng Li, Haizhou Wang +3

Red-Teaming & Adversarial Robustness

3w ago

Poisoning Learned Index Structures: Static and Dynamic Adversarial Attacks on ALEX

Learned indexes, despite their promise, can suffer up to 2.8x lookup slowdowns under targeted dynamic attacks, but only if the data distribution isn't too dense.

Allen Jue

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Enis Golaszewski +103w ago

Verifying Provenance of Digital Media: Why the C2PA Specifications Fall Short

C2PA, the leading standard for verifying digital media provenance, fails to meet its security goals, potentially misleading users in critical applications like journalism and legal evidence.

Enis Golaszewski, N. Krawetz, Alan T. Sherman +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Abraham Itzhak Weinberg3w ago

ARCANE: Cross-Campaign Attacker Re-identification via Passive Beacon Telemetry -- A Bayesian Network Framework for Longitudinal Cyber Attribution

Even with cross-campaign aggregation of telemetry data, distinguishing sophisticated cyber adversaries remains fundamentally limited by shared operational practices, revealing a structural ceiling on attribution accuracy.

Abraham Itzhak Weinberg

Natural Language Processing Red-Teaming & Adversarial Robustness

Mengnan Zhao +83w ago

Unveiling the Backdoor Mechanism Hidden Behind Catastrophic Overfitting in Fast Adversarial Training

Catastrophic overfitting in fast adversarial training isn't just overfitting – it's a backdoor, and now we can use backdoor defenses to fix it.

Mengnan Zhao, Mengnan Zhao, Lihe Zhang +6

Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

Mengnan Zhao +103w ago

Mitigating Error Amplification in Fast Adversarial Training

Low-confidence training samples are secretly sabotaging your fast adversarial training, leading to catastrophic overfitting and a worse robustness-accuracy trade-off.

Mengnan Zhao, Mengnan Zhao, Lihe Zhang +8

Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

Yixiang Zhang +43w ago

AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents

Securing autonomous AI agents demands a lifecycle-oriented approach, and AgentWard provides a blueprint for defense-in-depth across initialization, input processing, memory, decision-making, and execution.

Yixiang Zhang, Xinhao Deng, Jiaqi Wu +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Dazhuang Liu +33w ago

DETOUR: A Practical Backdoor Attack against Object Detection

Object detection models are surprisingly vulnerable to practical backdoor attacks using real-world semantic triggers that work across different sizes, locations, and viewpoints.

Dazhuang Liu, Yanqi Qiao, Kaitai Liang +1

Computer Vision Red-Teaming & Adversarial Robustness

Poushali Sengupta +33w ago

X-NegoBox: An Explainable Privacy-Budget Negotiation Framework for Secure Peer-to-Peer Energy Data Exchange

Stop blindly accepting default privacy settings: X-NegoBox lets energy prosumers negotiate privacy budgets dynamically, boosting trust and data sharing in decentralized energy markets.

Poushali Sengupta, Sabita Maharjan, Frank Eliassen +1

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Pablo Mateo-Torrej'on +13w ago

GAMMAF: A Common Framework for Graph-Based Anomaly Monitoring Benchmarking in LLM Multi-Agent Systems

LLM multi-agent systems can substantially reduce operational costs by using effective attack remediation to facilitate early consensus and cut off token generation by adversarial agents, as shown by GAMMAF.

Pablo Mateo-Torrej'on, Alfonso S'anchez-Maci'an

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Abdallah Abou Hasna +23w ago

From Spoofing to Trust: Emergency Alerts Spoofing Testbed and Cross-Cell Verification

5G emergency alert systems are surprisingly vulnerable to spoofing attacks that can do more than just display fake warnings.

Abdallah Abou Hasna, N. Chendeb, A. Falou

Open-Source Models & Weights Red-Teaming & Adversarial Robustness

Zijun Feng +63w ago·also School of Cyber Science and Technology, SYSU

GoAT-X: A Graph of Auditing Thoughts for Securing Token Transactions in Cross-Chain Contracts

LLMs can now audit cross-chain smart contracts with expert-level precision, achieving 95% coverage of vulnerable projects by explicitly mirroring human reasoning processes.

Zijun Feng, Yuming Feng, Yu Wang +4

Code Generation & Program Synthesis Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

V'ictor Mayoral-Vilches +83w ago

Dynamic Cyber Ranges

Forget static defenses: LLM-powered "Defender" agents can dynamically harden cyber ranges, slashing attacker success rates and leveling the playing field as AI-driven threats evolve.

V'ictor Mayoral-Vilches, Mar'ia Sanz-G'omez, Francesco Balassone +6

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

School of Cyber Science and Technology3w ago

Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing

Backdoor attacks in LLMs can be defused at inference time, without retraining or external data, by geometrically smoothing attention patterns to disrupt adversarial routing.

Kaisheng Fan, Weizhe Zhang, Yishu Gao +2

Inference & Quantization Natural Language Processing Red-Teaming & Adversarial Robustness

Jiaqi Li +53w ago

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Forget external firewalls – ClawdGo teaches AI agents to spot and fend off attacks from the inside, boosting their security smarts by 20% through self-play.

Jiaqi Li, Yangyang Zhao, Binxue Sun +3

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Zonghao Ying +73w ago

AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization

LLM agents can achieve near-impregnable defense against prompt injection with minimal utility loss by borrowing classic operating system virtualization techniques.

Zonghao Ying, Haozheng Wang, Jiangfan Liu +5

Red-Teaming & Adversarial Robustness Tool Use & Agents

Search

Red-Teaming & Adversarial Robustness - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (100)