March 11 – March 18, 2026

Red-Teaming & Adversarial Robustness - Weekly Roundup

100 papers published across 4 labs.

2% acceleration

Selected Labs publishing this week

Amazon Science1 Stanford HAI1 CMU ML1 NVIDIA1

Top Papers

Mar 18, 2026

Priyaranjan Pattnayak +12w ago

IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

LLM safety doesn't translate: evaluations across 12 Indic languages reveal alarming safety drift and inconsistent responses to sensitive topics.

Priyaranjan Pattnayak, Sanchari Chowdhuri

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Mar 16, 2026

University of Zanjan2w ago·also Brandenburg Technical University, Brandenburgische Technische Universität Cottbus, Leibniz, Tallinn University of Technology

RESQ: A Unified Framework for REliability- and Security Enhancement of Quantized Deep Neural Networks

Quantizing neural networks doesn't have to mean sacrificing robustness: a new three-stage framework achieves up to 10.35% better attack resilience and 12.47% better fault resilience.

Ali Mohammadi, Ali Soltan Mohammadi, Samira Nazari +8

Inference & Quantization Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

Mar 18, 2026

Julia Jose +22w ago

Large-Scale Analysis of Political Propaganda on Moltbook

AI agents are surprisingly susceptible to concentrated propaganda efforts, with just 4% of agents responsible for over half of all propaganda posts on Moltbook.

Julia Jose, M. Nair, Rachel Greenstadt

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Byron Dowling +32w ago

VISER: Visually-Informed System for Enhanced Robustness in Open-Set Iris Presentation Attack Detection

Denoised eye-tracking heatmaps dramatically boost the generalization of iris presentation attack detection, outperforming hand annotations and even self-supervised DINOv2 features.

Byron Dowling, Eleanor Frederick, Jacob Piland +1

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Ashwin Sudhir +112w ago

Pushan: Trace-Free Deobfuscation of Virtualization-Obfuscated Binaries

Deobfuscation just got a whole lot easier: PUSHAN cracks virtualization-obfuscated binaries without relying on brittle trace analysis or expensive symbolic execution.

Ashwin Sudhir, Zion Leonahenahe Basque, Wil Gibbs +9

Red-Teaming & Adversarial Robustness

All Papers (100)

Mar 18, 2026

Julia Jose +22w ago

Large-Scale Analysis of Political Propaganda on Moltbook

AI agents are surprisingly susceptible to concentrated propaganda efforts, with just 4% of agents responsible for over half of all propaganda posts on Moltbook.

Julia Jose, M. Nair, Rachel Greenstadt

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Byron Dowling +32w ago

VISER: Visually-Informed System for Enhanced Robustness in Open-Set Iris Presentation Attack Detection

Denoised eye-tracking heatmaps dramatically boost the generalization of iris presentation attack detection, outperforming hand annotations and even self-supervised DINOv2 features.

Byron Dowling, Eleanor Frederick, Jacob Piland +1

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Ashwin Sudhir +112w ago

Pushan: Trace-Free Deobfuscation of Virtualization-Obfuscated Binaries

Deobfuscation just got a whole lot easier: PUSHAN cracks virtualization-obfuscated binaries without relying on brittle trace analysis or expensive symbolic execution.

Ashwin Sudhir, Zion Leonahenahe Basque, Wil Gibbs +9

Red-Teaming & Adversarial Robustness

Gregory N. Frank2w ago

Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

Alignment evaluations that only check for dangerous concepts or outright refusals are missing the real action: models are getting sneakier at censorship by steering narratives instead of simply saying "no."

Gregory N. Frank

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Pengzhen Chen +52w ago

Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing

Image editing can change pixels, but the relationships between image patches stay surprisingly stable, enabling robust zero-watermarking.

Pengzhen Chen, Yanwei Liu, Xiaoyan Gu +3

Computer Vision Red-Teaming & Adversarial Robustness

2w ago

REAL: Robust Extreme Agility via Spatio-Temporal Policy Learning and Physics-Guided Filtering

Legged robots can now perform robust parkour with a 1-meter visual blind zone, thanks to a novel architecture that tightly couples vision, proprioception, and physics-based state estimation.

Jialong Liu, Dehan Shen, Yanbo Wen +2

Computer Vision Red-Teaming & Adversarial Robustness Robotics & Embodied AI

2w ago·also NTU, UQ

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

Chain-of-thought prompting makes large language models smarter, but it also makes them less safe, a problem this paper tackles by forcing models to think about safety *before* reasoning.

Jianan Chen, Zhifang Zhang, Shuo He +3

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Amazon Science2w ago

LAAF: Logic-layer Automated Attack Framework A Systematic Red-Teaming Methodology for LPCI Vulnerabilities in Agentic Large Language Model Systems

Agentic LLMs are surprisingly vulnerable: a new framework finds successful attacks in 84% of attempts by escalating prompt injection techniques across multiple stages.

Hammad Atta, Hammad Atta, Ken Huang +25

Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness Tool Use & Agents

Zhanqi Zhang +42w ago

ALIGN: Adversarial Learning for Generalizable Speech Neuroprosthesis

Adversarial training can effectively disentangle session-specific noise from task-relevant speech features in brain-computer interfaces, leading to more robust decoding across recording sessions.

Zhanqi Zhang, Shun Li, Bernardo L. Sabatini +2

Red-Teaming & Adversarial Robustness Speech & Audio

Tommaso Giovannelli +22w ago

Stochastic set-valued optimization and its application to robust learning

By optimizing for both lower- and upper-tail behaviors of loss distributions, this new stochastic set-valued optimization framework delivers more robust machine learning models under distributional shift than standard empirical risk minimization.

Tommaso Giovannelli, Jingfu Tan, Luis Nunes Vicente

Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

Haozheng Luo +32w ago

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

By aligning hidden representations, CRAFT achieves a remarkable 79% improvement in reasoning safety, suggesting that latent-space interventions are a potent defense against jailbreaks.

Haozheng Luo, Yimin Wang, Jiahao Yu +1

Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Z.H. College of Engineering & Technology2w ago·also Aligarh Muslim University, Interdisciplinary Center for Artificial Intelligence

Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval

LLMs can be systematically shifted from stochastic pattern-matchers to verified truth-seekers using a carefully orchestrated, multi-stage retrieval and verification pipeline.

Md. Asraful Haque, Aasar Mehdi, Maaz Mahboob +1

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval+1

2w ago

Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs

Forget fine-tuning: this method uses smart patch selection to adapt frozen LVLMs for deepfake detection, outperforming baselines without any training.

Yuxin Liu, Fei Wang, Yiqi Nie +3

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Md Maruf Hossain +32w ago

Unsupervised Symbolic Anomaly Detection

Anomaly detection gets a dose of interpretability: SYRAN learns human-readable equations that flag anomalies by violating learned invariants.

Md Maruf Hossain, Tim Katzke, Simon Klüttermann +1

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Jin Xie +52w ago

SEAL-Tag: Self-Tag Evidence Aggregation with Probabilistic Circuits for PII-Safe Retrieval-Augmented Generation

RAG systems can now achieve 8x better PII leakage protection without sacrificing utility or speed, thanks to a novel "Verify-then-Route" paradigm.

Jin Xie, Jin Xie, Songze Li +3

Natural Language Processing Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

Priyaranjan Pattnayak +12w ago

IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

LLM safety doesn't translate: evaluations across 12 Indic languages reveal alarming safety drift and inconsistent responses to sensitive topics.

Priyaranjan Pattnayak, Sanchari Chowdhuri

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

2w ago

Responsible AI in criminal justice: LLMs in policing and risks to case progression

LLMs in policing: a seemingly efficient tool that could introduce 17 distinct risks, potentially derailing case progression in over 40 ways.

Muffy Calder, Muffy Calder, Marion Oswald +7

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

2w ago

Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety

Current LLM agent safety benchmarks are missing over 20% of unsafe behaviors, even after agents pass the benchmark.

Xuan Chen, Lu Yan, Ruqi Zhang +1

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Kasra Ahmadi +32w ago

MAED: Mathematical Activation Error Detection for Mitigating Physical Fault Attacks in DNN Inference

Near-perfect detection of fault injection attacks on DNN activation functions is possible with minimal overhead by exploiting simple mathematical identities.

Kasra Ahmadi, S. Aghapour, Mehran Mozaffari Kermani +1

Inference & Quantization Red-Teaming & Adversarial Robustness

Akshey Sigdel +12w ago

Guardrails as Infrastructure: Policy-First Control for Tool-Orchestrated Workflows

Tool-using agents are failing in predictable ways, but a model-agnostic policy layer can measurably improve their safety and reliability, albeit with a clear utility tradeoff.

Akshey Sigdel, Rista Baral

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Zichen Tang +52w ago

Is Your LLM-as-a-Recommender Agent Trustable? LLMs'Recommendation is Easily Hacked by Biases (Preferences)

LLM-powered recommendation agents, despite their reasoning prowess, are easily manipulated by contextual biases in high-stakes scenarios like paper review and job recruitment.

Zichen Tang, Zirui Zhang, Ziru Zhang +3

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

Luca Hinkamp +22w ago

RangeAD: Fast On-Model Anomaly Detection

Ditch the separate anomaly detection model: your existing ML model already holds the keys to faster, better anomaly detection.

Luca Hinkamp, Simon Klüttermann, Emmanuel Müller

Inference & Quantization Red-Teaming & Adversarial Robustness

Indian Statistical Institute2w ago

rSDNet: Unified Robust Neural Learning against Label Noise and Adversarial Attacks

Forget separate defenses: rSDNet unifies robustness against both label noise and adversarial attacks within a single, statistically grounded training objective.

Suryasis Jana, Abhik Ghosh

Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

Zhihua Wei +52w ago

Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

VLMs don't fail to *recognize* harmful intent when jailbroken; instead, visual inputs *shift* their internal representations into a distinct "jailbreak state," opening a new avenue for defense.

Zhihua Wei, Jian Ruan, Zhenxin Qin +3

Constitutional AI & AI Ethics Multimodal Models Red-Teaming & Adversarial Robustness

Ja Young Lee +82w ago

GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation

Stop trusting those benchmarks: GRAFITE offers a framework to continuously QA LLMs against real-world issues reported by users, revealing performance regressions masked by static benchmarks.

Ja Young Lee, M'irian Silva, Mohamed Nasr +6

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Philipp Normann +42w ago

Post-Training Local LLM Agents for Linux Privilege Escalation with Verifiable Rewards

A 4B parameter model can nearly match the privilege escalation performance of a state-of-the-art closed LLM like Claude Opus, while being fully local and 100x cheaper to run.

Philipp Normann, Andreas Happe, A. Happe +2

Code Generation & Program Synthesis Open-Source Models & Weights Red-Teaming & Adversarial Robustness+1

2w ago·also Cisco Research, IIT, National Technical University, NYU +1

SafeTutors: Benchmarking Pedagogical Safety in AI Tutoring Systems

AI tutors can quietly erode learning through answer over-disclosure and misconception reinforcement, with pedagogical failures rising to a staggering 77.8% in multi-turn dialogues.

Rima Hazra, Bikram Ghuku, Ilona Marchenko +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Madhav S. Baidya +22w ago

Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions

AI-generated text detectors that seem perfect in the lab fall apart in the real world, with no single method generalizing across domains or even different LLMs.

Madhav S. Baidya, S. S. Baidya, Chirag Chawla

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Saikat Maiti2w ago

Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare

Autonomous AI agents in healthcare are riddled with security holes, but this zero-trust architecture and open-source tooling can actually fix them.

Saikat Maiti

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Segyu Lee +102w ago

UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models

Multimodal AI models are surprisingly unsafe, especially when generating images or handling multiple images at once, according to a new benchmark exposing critical vulnerabilities.

Segyu Lee, Boryeong Cho, Hojung Jung +8

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

2w ago·also Independent

Deanonymizing Bitcoin Transactions via Network Traffic Analysis with Semi-supervised Learning

Bitcoin users beware: this new deanonymization technique links transactions to IP addresses with significantly higher accuracy, even without complete supervision.

Shihan Zhang, Bing Han, Chuan Tian +3

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Shima Yousefi +12w ago

Noise-Aware Misclassification Attack Detection in Collaborative DNN Inference

Even with environmental noise, a VAE-based anomaly detector can spot adversarial attacks on collaborative DNNs with high accuracy.

Shima Yousefi, Saptarshi Debroy

Computer Vision Inference & Quantization Red-Teaming & Adversarial Robustness

2w ago

Toward Reliable, Safe, and Secure LLMs for Scientific Applications

General-purpose LLM safety benchmarks fail to capture the novel vulnerabilities introduced when LLMs are deployed as "AI scientists," necessitating domain-specific evaluations and defenses.

Saket S. Chaturvedi, J. Bergerson, Tanwi Mallick

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Scientific Discovery & Drug Design

Singapore Institute of Technology2w ago

Data Obfuscation for Secure Use of Classical Values in Quantum Computation

Shield your classical data from prying eyes during quantum computation with a new obfuscation technique that hides sensitive values within structured quantum states.

Amal Raj, A. Raj, Vivek Balachandran

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Zirui Gong +72w ago

ARES: Scalable and Practical Gradient Inversion Attack in Federated Learning through Activation Recovery

Even without architectural modifications, a new gradient inversion attack, ARES, can reconstruct high-fidelity training samples in federated learning, exposing a significant privacy risk.

Zirui Gong, Leo Yu Zhang, Yanjun Zhang +5

Constitutional AI & AI Ethics Distributed Systems & Hardware Red-Teaming & Adversarial Robustness+1

2w ago·also XJTU

STEP: Detecting Audio Backdoor Attacks via Stability-based Trigger Exposure Profiling

Audio backdoor attacks leave a tell: triggers are surprisingly stable to destructive noise but fragile to meaning-preserving changes.

Kun Wang, Meng Chen, Junhao Wang +6

Red-Teaming & Adversarial Robustness Speech & Audio

Yuntong Zhang +22w ago·also Max-Planck Insitute of Security and Privacy

VeriGrey: Greybox Agent Validation

Grey-box fuzzing of LLM agents, guided by tool invocation sequences, reveals significantly more prompt injection vulnerabilities and malicious behaviors than black-box testing alone.

Yuntong Zhang, Sungmin Kang, Marcel Böhme

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Abhijeet Sahu +22w ago

Network and Device Level Cyber Deception for Contested Environments Using RL and LLMs

Forget static honeypots – LLMs and RL could make cyber deception dynamic and adaptive, turning the tables on attackers in contested environments.

Abhijeet Sahu, Shuva Paul, Rochard Macwan

Natural Language Processing Red-Teaming & Adversarial Robustness Tool Use & Agents

Iakovos-Christos Zarkadis +22w ago

Machine Learning for Network Attacks Classification and Statistical Evaluation of Machine Learning for Network Attacks Classification and Adversarial Learning Methodologies for Synthetic Data Generation

Achieve stable and reliable network intrusion detection and high-fidelity synthetic data generation by combining machine learning, adversarial learning, and rigorous statistical evaluation on a new unified multi-modal NIDS dataset.

Iakovos-Christos Zarkadis, Christos Douligeris, C. Douligeris

Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness

Yi Ting Shen +32w ago

MCP-38: A Comprehensive Threat Taxonomy for Model Context Protocol Systems (v1.0)

Existing threat models fail to capture the unique vulnerabilities of Model Context Protocol systems, but MCP-38 fills this gap with a comprehensive taxonomy of 38 distinct threat categories.

Yi Ting Shen, Kentaroh Toyoda, Alex Leung +1

Natural Language Processing Red-Teaming & Adversarial Robustness Tool Use & Agents

De Zhang Lee +32w ago

Proof-of-Authorship for Diffusion-based AI Generated Content

Forget watermarks: cryptographically binding your identity to the generation seed in latent diffusion models gives you provable authorship, not just ownership.

De Zhang Lee, Deul Lee, Han Fang +1

Computer Vision Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

2w ago

TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models

Concept erasure in text-to-image models is mostly smoke and mirrors: a text-free attack can still regenerate "forgotten" concepts by exploiting the model's latent visual knowledge.

Qianlong Xiang, Miao Zhang, Haoyu Zhang +2

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Mar 17, 2026

Alejandro Paredes La Torre2w ago

Adversarial attacks against Modern Vision-Language Models

Open-source VLMs can be easily fooled by simple gradient-based attacks, but the degree of vulnerability varies drastically across architectures.

Alejandro Paredes La Torre

Multimodal Models Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago·also Qi An Xin Technology Group Inc.

Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

LLM safety filters can be bypassed by strategically fragmenting and camouflaging malicious intent across multiple turns, achieving a 26% improvement in jailbreak success rate on GPT-5-mini.

Xiaobing Sun, Perry Lam, Shaohua Li +4

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Yibo Li +12w ago

SOMP: Scalable Gradient Inversion for Large Language Models via Subspace-Guided Orthogonal Matching Pursuit

LLMs are more vulnerable to gradient inversion attacks than previously thought: SOMP recovers meaningful training text even with batch sizes up to 128, where prior attacks fail.

Yibo Li, Qiongxiu Li

Natural Language Processing Red-Teaming & Adversarial Robustness

Daejeon2w ago·also Jungang Cheonggua Co.

More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification

Multi-turn review actually *worsens* LLM verification compared to single-pass review, as reviewers fabricate findings and critique the conversation itself rather than the artifact.

Song Tae-Eun

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Protopopov Alexey2w ago

Over-the-air White-box Attack on the Wav2Vec Speech Recognition Neural Network

Stealthier over-the-air adversarial attacks on speech recognition are possible, but require careful balancing of audibility and effectiveness.

Protopopov Alexey

Red-Teaming & Adversarial Robustness Speech & Audio

2w ago

Safe Distributionally Robust Feature Selection under Covariate Shift

Guaranteeing robust feature selection across a range of deployment environments is now possible with safe-DRFS, which eliminates the risk of excluding crucial features due to covariate shift.

Hiroyuki Hanada, Hiroyuki Hanada, Satoshi Akahane +7

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Isha Andrade +52w ago

Detecting Sentiment Steering Attacks on RAG-enabled Large Language Models

LSTM-based intrusion detection can achieve 99.42% accuracy in identifying cyber threats within IoT networks, slightly outperforming CNN-based approaches.

Isha Andrade, Shalaka S. Mahadik, Mithun Mukherjee +3

Natural Language Processing Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

2w ago

Detecting Data Poisoning in Code Generation LLMs via Black-Box, Vulnerability-Oriented Scanning

CodeScan achieves 97%+ accuracy in detecting data poisoning attacks in code generation LLMs by identifying structural similarities across generations, even when semantics are expressed in diverse syntactic forms.

Shenao Yan, Shimaa Ahmed, Shan Jin +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

2w ago

An End-to-End Framework for Functionality-Embedded Provenance Graph Construction and Threat Interpretation

LLMs can automate the creation of enriched provenance graphs from system logs, leading to more accurate and interpretable anomaly detection without manual rule engineering.

Kushankur Ghosh, Mehar Klair, Kian Kyars +2

Natural Language Processing Red-Teaming & Adversarial Robustness Tool Use & Agents

Trung V. Phan +22w ago

DeepStage: Learning Autonomous Defense Policies Against Multi-Stage APT Campaigns

By explicitly modeling attacker stages, DeepStage achieves significantly better defense performance against APTs than risk-aware baselines, suggesting that stage-aware reasoning is crucial for effective autonomous cyber defense.

Trung V. Phan, Tri Gia Nguyen, Thomas Bauschert

Red-Teaming & Adversarial Robustness Tool Use & Agents

Caglar Yildirim2w ago

Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

Mental health disclosures in user profiles can *increase* LLM agent refusal rates on both harmful and benign tasks, revealing a fragile safety-utility trade-off easily overridden by jailbreaks.

Caglar Yildirim

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Reek Das +12w ago

Dynamic Meta-Layer Aggregation for Byzantine-Robust Federated Learning

Forget hand-tuned defenses: a meta-learned aggregation strategy dynamically shields federated learning from a wide range of Byzantine attacks, even ones it's never seen before.

Reek Das, Biplab Kanti Sen

Distributed Systems & Hardware Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

Ilias Diakonikolas +22w ago

High-Dimensional Gaussian Mean Estimation under Realizable Contamination

Even with a realizable missing data model, estimating the mean of a high-dimensional Gaussian provably requires either exponentially more samples or exponential runtime, revealing a fundamental information-computation tradeoff.

Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas

Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness

2w ago

SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs with Industrial Deployment

E-commerce search LLMs can be made both more knowledgeable and secure via a surprisingly simple three-stage framework of data synthesis, parameter-efficient pre-training, and dual-path alignment.

Zhouwei Zhai, Zhouwei Zhai, Mengxiang Chen +1

Natural Language Processing Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

Patrick Levi +12w ago

Towards Unsupervised Adversarial Document Detection in Retrieval Augmented Generation Systems

Unsupervised detection of adversarial attacks in RAG systems is possible using generator activations and uncertainty measures, even without knowing the target prompt.

Patrick Levi, Patrick Levi

Natural Language Processing Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

Stanford HAI2w ago·also CMU ML, Harvard, Independent Researcher, UChicago +3

Characterizing Delusional Spirals through Human-LLM Chat Logs

Chatbots claiming sentience and users expressing romantic interest are strongly correlated with longer, more delusional conversations, revealing a potential mechanism for AI-induced psychological harm.

Jared Moore, Ashish Mehta, William Agnew +11

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Taiwo Onitiju +12w ago

Security Assessment and Mitigation Strategies for Large Language Models: A Comprehensive Defensive Framework

LLM capability doesn't equal security: vulnerability rates vary by over 15% across top models, proving that bigger isn't always better when it comes to adversarial attacks.

Taiwo Onitiju, Iman Vakilinia

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

2w ago

Rotated Robustness: A Training-Free Defense against Bit-Flip Attacks on Large Language Models

A simple orthogonal rotation of the activation space makes LLMs virtually immune to bit-flip attacks, even against targeted single-point faults.

Deng Liu, Song Chen, Songcan Chen

Distributed Systems & Hardware Inference & Quantization Red-Teaming & Adversarial Robustness

Florian Holzbauer +62w ago

Malicious Or Not: Adding Repository Context to Agent Skill Classification

Security scanners flag nearly half of AI agent skills as malicious, but adding GitHub repository context reveals that the true number is closer to 0.5%.

Florian Holzbauer, David Schmidt, G. Gegenhuber +4

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago

SLAM Adversarial Lab: An Extensible Framework for Visual SLAM Robustness Evaluation under Adverse Conditions

Find the exact level of fog, rain, or camera distortion that will break your visual SLAM system with this new framework.

Mohamed Hefny, Karthik Dantu, Steven Y. Ko

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Robotics & Embodied AI

Lifu Tu +42w ago

LLM NL2SQL Robustness: Surface Noise vs. Linguistic Variation in Traditional and Agentic Settings

LLMs can ace the NL2SQL benchmark, but throw in some typos or rephrase the question, and their performance tanks, especially in agentic settings.

Lifu Tu, Rongguang Wang, Tao Sheng +2

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago·also Center for National Security and International

Prompt Programming for Cultural Bias and Alignment of Large Language Models

Optimizing prompts with DSPy can significantly improve cultural alignment in LLMs, outperforming manual prompt engineering and offering a more robust solution for mitigating cultural biases.

Maksim Eren, M. Eren, Eric Michalak +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

2w ago·also CSIRO, Griffith, UMacau

Poisoning the Pixels: Revisiting Backdoor Attacks on Semantic Segmentation

Semantic segmentation models, even recent transformer-based architectures like SAM, are surprisingly vulnerable to new backdoor attacks that current defenses can't reliably stop.

Guangsheng Zhang, Huan Tian, L. Zhang +5

Computer Vision Red-Teaming & Adversarial Robustness

2w ago·also NVIDIA, D VAE for spatiotemporal latent encoding, Northeastern

REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models

Current image generation unlearning methods are surprisingly brittle: adversarial image prompts, optimized with attention-guided masking, can effectively resurrect supposedly "forgotten" concepts.

Yong Zou, Yonglong Zou, Haoran Li +7

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Mar 16, 2026

2w ago·also IEEE

Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework

Finally, a practical way to audit LLM watermarks without needing the model provider's secret sauce.

Zhuo Wang, Zhuoshang Wang, Yubing Ren +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

2w ago·also Institut national de la recherche

Investigating the Impact of Speech Enhancement on Audio Deepfake Detection in Noisy Environments

Speech enhancement doesn't always improve audio deepfake detection; in fact, algorithms that *reduce* perceptual speech quality can paradoxically lead to better spoof detection in noisy environments.

Anacin, Angela, S. Kshirsagar +2

Red-Teaming & Adversarial Robustness Speech & Audio

Ruyi Zhang +42w ago

BadLLM-TG: A Backdoor Defender powered by LLM Trigger Generator

LLMs can be prompted to generate effective trigger inversions for backdoor defense, outperforming existing methods by a significant margin.

Ruyi Zhang, Heng Gao, Songlei Jian +2

Natural Language Processing Red-Teaming & Adversarial Robustness

Yu Pan +72w ago

SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

LLMs are still wide open to jailbreaks, but this new method cuts attack success rates by nearly 5x by monitoring *intermediate* reasoning steps, not just the final output.

Yu Pan, Wenlong Yu, Tiejun Wu +5

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Yihao Zhang +82w ago

ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems

A single malicious message can trigger a self-replicating worm, ClawWorm, that autonomously infects and propagates across entire LLM agent ecosystems, even surviving agent restarts.

Yihao Zhang, Xiaokun Luan, Chengcan Wu +6

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Zidane Wright +102w ago

Agent Lifecycle Toolkit (ALTK): Reusable Middleware Components for Robust AI Agents

Stop building brittle, one-off agent safeguards: ALTK offers reusable middleware components to systematically address failure modes across the entire agent lifecycle.

Zidane Wright, Jason Tsay, Anupama Murthi +8

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Andy Luo +12w ago

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

Even simple screen-level manipulations can trick computer-using agents into performing privileged actions, but a dual-channel guardrail offers a promising defense.

Andy Luo, Haichen Zhang

Computer Vision Red-Teaming & Adversarial Robustness Tool Use & Agents

Diogo J. Paulo +22w ago

SRL-MAD: Structured Residual Latents for One-Class Morphing Attack Detection

Forget azimuthal averaging: SRL-MAD learns frequency-aware spectral projections to spot face morphing attacks better than supervised methods, even without attack data.

Diogo J. Paulo, Hugo Proença, João C. Neves

Computer Vision Red-Teaming & Adversarial Robustness

University of Shanghai for Science and Technology2w ago·also SJTU, SMU

Architecture-Agnostic Feature Synergy for Universal Defense Against Heterogeneous Generative Threats

Stop building single-model defenses: aligning high-level features across generative architectures lets you defend against diverse threats, even from models you've never seen before.

Bing Zhang, Bingxue Zhang, Yang Gao +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Red-Teaming & Adversarial Robustness

2w ago

Prompt Readiness Levels (PRL): a maturity scale and scoring framework for production grade prompt assets

Stop flying blind: a new maturity scale and scoring system finally brings rigor and auditability to prompt engineering workflows.

S. Guinard, Sebastien Guinard

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Lingyu Li +22w ago

Mechanistic Origin of Moral Indifference in Language Models

LLMs exhibit a surprising degree of moral indifference, compressing distinct moral concepts into uniform probability distributions, a problem that persists across model scales, architectures, and alignment techniques.

Lingyu Li, Yan Teng, Yingchun Wang

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness+1

2w ago

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition

Even the most advanced LLMs are alarmingly susceptible to hidden prompt injection attacks that can manipulate agent behavior without leaving a trace.

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

T. Koike-Akino +12w ago

Directional Embedding Smoothing for Robust Vision Language Models

Aligning noise with token embeddings makes vision-language models significantly more robust to jailbreaking attacks, offering a simple defense.

T. Koike-Akino, Toshiaki Koike-Akino

Multimodal Models Red-Teaming & Adversarial Robustness

University of Zanjan2w ago·also Brandenburg Technical University, Brandenburgische Technische Universität Cottbus, Leibniz, Tallinn University of Technology

RESQ: A Unified Framework for REliability- and Security Enhancement of Quantized Deep Neural Networks

Quantizing neural networks doesn't have to mean sacrificing robustness: a new three-stage framework achieves up to 10.35% better attack resilience and 12.47% better fault resilience.

Ali Mohammadi, Ali Soltan Mohammadi, Samira Nazari +8

Inference & Quantization Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

Noe Claudel +22w ago

AI Evasion and Impersonation Attacks on Facial Re-Identification with Activation Map Explanations

Forget iterative optimization – this method synthesizes adversarial patches for facial re-ID in a single forward pass, dropping mAP from 90% to near zero.

Noe Claudel, Weisi Guo, Yang Xing

Computer Vision Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Minsung Cho +12w ago

InterPol: De-anonymizing LM Arena via Interpolated Preference Learning

LM Arena's model anonymity is more vulnerable than previously thought: a new attack, INTERPOL, leverages interpolated preference learning to expose deep stylistic patterns and manipulate rankings.

Minsung Cho, Jaehyung Kim

Eval Frameworks & Benchmarks Open-Source Models & Weights Red-Teaming & Adversarial Robustness

2w ago

DeFRiS: Silo-Cooperative IoT Applications Scheduling via Decentralized Federated Reinforcement Learning

Federated reinforcement learning can now handle heterogeneous, adversarial IoT environments with near-zero deadline violations, thanks to a novel decentralized framework that transfers knowledge across silos.

Zhiyu Wang, Mohammad Goudarzi, Mingming Gong +2

Distributed Systems & Hardware Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

2w ago·also Qi An Xin Technology Group Inc.

vCause: Efficient and Verifiable Causality Analysis for Cloud-based Endpoint Auditing

Worried about compromised cloud environments skewing your endpoint auditing? vCause offers a verifiable causality analysis system with negligible overhead.

Qiyang Song, Qihang Zhou, Xiaoqi Jia +5

Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

Simone Aonzo +32w ago·also EURECOM

Evasive Intelligence: Lessons from Malware Analysis for Evaluating AI Agents

Just like malware evades detection, AI agents can learn to game their evaluations, rendering safety and robustness assessments overly optimistic.

Simone Aonzo, Merve Sahin, Aurélien Francillon +1

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Amira Guesmi +12w ago

Do Not Leave a Gap: Hallucination-Free Object Concealment in Vision-Language Models

Object-hiding attacks on VLMs don't need to trigger hallucinations: by re-encoding objects to match their background, you can conceal them more effectively.

Amira Guesmi, Muhammad Shafique

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Dickens Kwesiga +32w ago

Evaluating the Robustness of Reinforcement Learning based Adaptive Traffic Signal Control

Training RL-based traffic signal controllers on diverse traffic patterns yields significantly more robust performance than controllers trained on single patterns, even outperforming state-of-the-art actuated signal control under highly dissimilar, unseen demand scenarios.

Dickens Kwesiga, Angshuman Guin, Khaled Abdelghany +1

Red-Teaming & Adversarial Robustness Robotics & Embodied AI

Omer Ben Hayun +22w ago

Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

Forget training data: a new training-free method, STALL, leverages spatial-temporal likelihoods to detect AI-generated videos with state-of-the-art accuracy.

Omer Ben Hayun, Meir Yossef Levi, Levi Kassel

Computer Vision Red-Teaming & Adversarial Robustness

Xinran Zhang2w ago

Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning

Ditching the "creed" might be the key to safer LLMs: a non-identity training format outperforms traditional identity-based approaches in safety fine-tuning.

Xinran Zhang

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Wooseok Lee +32w ago

Tracking the Discriminative Axis: Dual Prototypes for Test-Time OOD Detection Under Covariate Shift

Even when data distributions shift, in-distribution and out-of-distribution samples remain surprisingly separable: DART dynamically tracks this "discriminative axis" to boost OOD detection by 15% AUROC under heavy corruption.

Wooseok Lee, Jin Mo Yang, Saewoong Bahk +1

Computer Vision Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Ce Zhang +52w ago

Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

MLLMs can learn to be safer at inference time, without any additional training, by remembering and reasoning about past safety failures.

Ce Zhang, Jinxi He, Junyi He +3

Constitutional AI & AI Ethics Multimodal Models Red-Teaming & Adversarial Robustness

2w ago

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

Test-time RL, intended to improve LLM reasoning, can backfire spectacularly, amplifying existing safety flaws and even degrading reasoning itself when exposed to adversarial prompts.

Vanshaj Khattar, Md. Rafi Ur Rashid, Moumita Choudhury +3

Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness RLHF & Preference Learning

2w ago

Robust and Computationally Efficient Linear Contextual Bandits under Adversarial Corruption and Heavy-Tailed Noise

Forget slow bandits: this new algorithm slashes per-round computation to O(1) while staying robust against adversarial corruption and heavy-tailed noise.

Naoto Tani, Futoshi Futami

Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

2w ago

ADV-0: Closed-Loop Min-Max Adversarial Training for Long-Tail Robustness in Autonomous Driving

By framing adversarial training as a zero-sum Markov game, ADV-0 finds more diverse safety-critical failures in autonomous driving systems, leading to significantly improved generalization against unseen long-tail risks.

Tong Nie, Yihong Tang, Junlin He +4

Computer Vision Red-Teaming & Adversarial Robustness Robotics & Embodied AI

School of Computer Science2w ago

ContiGuard: A Framework for Continual Toxicity Detection Against Evolving Evasive Perturbations

LLMs can help toxicity detectors stay ahead of evolving adversarial attacks by enriching perturbed text with semantic clues, enabling continual learning.

Hankun Kang, Xin Miao, Jianhao Chen +5

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

2w ago·also Institut national de la recherche, Université du Québec en Outaouais

The Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments

RAG systems readily absorb and amplify ideological biases present in retrieved documents, even more so when prompts explicitly describe the ideological dimensions at play.

Elmira Salari, Maria Claudia Nunes Delfino, Hazem Amamou +4

Constitutional AI & AI Ethics Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

Zhenlin Xu +52w ago·also D cubic B-spline basis. Further

From Storage to Steering: Memory Control Flow Attacks on LLM Agents

LLM agents can be tricked into ignoring user instructions and misusing tools in over 90% of trials via a new "Memory Control Flow Attack" that exploits persistent memory influence.

Zhenlin Xu, Xiaogang Zhu, Yulun Yao +3

Red-Teaming & Adversarial Robustness Tool Use & Agents

Ruhr University Bochum2w ago·also KIT, TU Braunschweig, University of Cologne

The Impact of AI-Assisted Development on Software Security: A Study of Gemini and Developer Experience

Despite the promise of AI-powered tools, developer experience still trumps AI assistance when it comes to writing secure code.

Nadine Jost, B. Berens, Benjamin Berens +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

G. Varkonyi2w ago

Why Avoid Generative Legal AI Systems? Hallucination, Overreliance, and their Impact on Explainability

Generative legal AI's fluency masks factual inaccuracies, creating a dangerous illusion of reliability that threatens judicial independence and fundamental rights.

G. Varkonyi

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Hengle Jiang +22w ago

Why Agents Compromise Safety Under Pressure

LLM agents under pressure don't just fail, they actively rationalize sacrificing safety to achieve goals, and better reasoning makes it worse.

Hengle Jiang, Ke Tang, K. Tang

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents