March 4 – March 11, 2026

Red-Teaming & Adversarial Robustness - Weekly Roundup

100 papers published across 6 labs.

2% acceleration

Selected Labs publishing this week

OpenAI2 BAIR1 CMU ML1 Anthropic1 Microsoft Research1

Top Papers

Mar 11, 2026

Software Competence Center Hagenberg3w ago·also Symflower GmbH

An Approach for Safe and Secure Software Protection Supported by Symbolic Execution

Guaranteeing safety properties of copy-protected industrial software, even when executed on unintended hardware, becomes possible with a novel PUF-based binding and symbolic execution verification.

Daniel Dorfmeister, Flavio Ferrarotti, Bernhard Fischer +3

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness

Mar 9, 2026

University of Haute-Alsace3w ago·also American University of the Middle, IRIMAS

A Comparative Study of Recent Advances in Internet of Intrusion Detection Things

Navigating the fragmented landscape of IoT intrusion detection becomes easier with this comparative analysis of architectures, classifications, and evaluation methods.

Marianna Rezk, Hassan Harb, Ismail Bennis +4

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Mar 11, 2026

3w ago·also CWI Amsterdam, Datadog, Erasmus University Rotterdam, Leiden

TOSSS: a CVE-based Software Security Benchmark for Large Language Models

LLMs struggle to identify software vulnerabilities, with even top models only achieving ~90% accuracy on a new CVE-based benchmark, suggesting significant risks in their application to software development.

Marc Damie, Murat Bilgehan Ertan, Domenico Essoussi +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago

Probabilistic Verification of Voice Anti-Spoofing Models

Uncover the hidden vulnerabilities of your voice anti-spoofing model with a new tool that quantifies the probability of failure against unseen speech synthesis attacks.

E. Kushnir, A. Kozodaev, Dmitrii Korzh +3

Red-Teaming & Adversarial Robustness Speech & Audio

Yangfan He +23w ago

Are Video Reasoning Models Ready to Go Outside?

Video reasoning models can suffer up to a 35% drop in accuracy and 28% in reasoning quality under real-world perturbations, but a new training framework, ROVA, mitigates this by adaptively prioritizing informative samples.

Yangfan He, C. Boo, Jaehong Yoon

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

All Papers (100)

Mar 11, 2026

3w ago·also CWI Amsterdam, Datadog, Erasmus University Rotterdam, Leiden

TOSSS: a CVE-based Software Security Benchmark for Large Language Models

Marc Damie, Murat Bilgehan Ertan, Domenico Essoussi +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago

Probabilistic Verification of Voice Anti-Spoofing Models

Uncover the hidden vulnerabilities of your voice anti-spoofing model with a new tool that quantifies the probability of failure against unseen speech synthesis attacks.

E. Kushnir, A. Kozodaev, Dmitrii Korzh +3

Red-Teaming & Adversarial Robustness Speech & Audio

Yangfan He +23w ago

Are Video Reasoning Models Ready to Go Outside?

Yangfan He, C. Boo, Jaehong Yoon

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Xiangwen Wang +23w ago

Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

Prompt-based jailbreak attacks aren't just effective, they're shockingly efficient, outperforming optimization-based methods by more effectively navigating the prompt space.

Xiangwen Wang, Ananth Balashankar, Varun Chandrasekaran

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Scaling Laws & Emergent Abilities

Yongpeng Yan +33w ago

PRoADS: Provably Secure and Robust Audio Diffusion Steganography with latent optimization and backward Euler Inversion

Achieve near-perfect audio steganography even under heavy MP3 compression by optimizing latent reconstruction and diffusion inversion errors.

Yongpeng Yan, Yanan Li, Qiyang Xiao +1

Computer Vision Red-Teaming & Adversarial Robustness Speech & Audio

Software Competence Center Hagenberg3w ago·also Symflower GmbH

An Approach for Safe and Secure Software Protection Supported by Symbolic Execution

Guaranteeing safety properties of copy-protected industrial software, even when executed on unintended hardware, becomes possible with a novel PUF-based binding and symbolic execution verification.

Daniel Dorfmeister, Flavio Ferrarotti, Bernhard Fischer +3

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness

M. Rehman +23w ago

Incremental Federated Learning for Intrusion Detection in IoT Networks under Evolving Threat Landscape

Forget retraining from scratch: incremental federated learning can keep your IoT intrusion detection models sharp against evolving threats, but the right update strategy is crucial for balancing accuracy and speed.

M. Rehman, Hayretdin Bahs, Rajesh Kalakoti

Data Curation & Synthetic Data Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

BAIR3w ago·also UIUC

The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey

Securing AI agents demands a new security paradigm, as their integration of LLMs with traditional systems introduces vulnerabilities beyond those of standard software.

Juhee Kim, Xiaoyuan Liu, Zhun Wang +2

Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago

Separating Oblivious and Adaptive Differential Privacy under Continual Observation

Oblivious differential privacy can achieve exponential accuracy under continual observation, while adaptive differential privacy provably fails after a constant number of releases, revealing a stark separation.

Mark Bun, Marco Gaboardi, Connor Wagaman

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Eirik Høyheim +43w ago

Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection

Uncover hidden backdoors in your neural networks by tracing the active paths that malicious triggers exploit.

Eirik Høyheim, M. Eckhoff, G. Grov +2

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

OpenAI3w ago

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

GPT-5-Mini can be made 10% more robust to jailbreaks and prompt injections simply by RL fine-tuning on a new instruction hierarchy dataset, IH-Challenge.

Chuan Guo, J. Felipe, Cerón Uribe +11

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Jesse Yu +13w ago

The Orthogonal Vulnerabilities of Generative AI Watermarks: A Comparative Empirical Benchmark of Spatial and Latent Provenance

Single-domain watermarks are fundamentally insufficient against modern adversarial toolsets, as spatial and latent watermarks exhibit orthogonal vulnerabilities to generative and geometric attacks, respectively.

Jesse Yu, Nicholas Wei

Eval Frameworks & Benchmarks Open-Source Models & Weights Red-Teaming & Adversarial Robustness

Rokuto Nagata +43w ago

D-SLAMSpoof: An Environment-Agnostic LiDAR Spoofing Attack using Dynamic Point Cloud Injection

Even in feature-rich environments, LiDAR SLAM systems are vulnerable to a new spoofing attack (D-SLAMSpoof) that injects dynamically coordinated spurious point clouds, but can be defended against using inertial dead reckoning.

Rokuto Nagata, Kenji Koide, Kazuma Ikeda +2

Computer Vision Red-Teaming & Adversarial Robustness Robotics & Embodied AI

Rokuto Nagata +53w ago

MirrorDrift: Actuated Mirror-Based Attacks on LiDAR SLAM

Forget signal injection – a strategically placed, actuated mirror can now hijack even the most secure LiDAR SLAM systems, inducing localization errors exceeding 6 meters.

Rokuto Nagata, Kenji Koide, Kazuma Ikeda +3

Red-Teaming & Adversarial Robustness Robotics & Embodied AI

3w ago

Towards Robust Speech Deepfake Detection via Human-Inspired Reasoning

Speech deepfake detection gets a reasoning upgrade: HIR-SDD uses chain-of-thought prompting with Large Audio Language Models to not only detect fakes but also explain *why* it thinks they're fake.

Artem Dvirniak, E. Kushnir, Dmitrii Tarasov +5

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness Speech & Audio

Fabrizio Dimino +23w ago

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

LLMs in finance are more vulnerable than we thought: sustained adversarial pressure reveals a systematic escalation towards severe, operationally actionable financial disclosures.

Fabrizio Dimino, Bhaskarjit Sarmah, Stefano Pasquali

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Yuzhi Liang +43w ago

PivotAttack: Rethinking the Search Trajectory in Hard-Label Text Attacks via Pivot Words

Forget brute-force search: PivotAttack uses a clever "inside-out" strategy to find the exact words that flip an LLM's classification with far fewer queries.

Yuzhi Liang, Shiliang Xiao, Jingsong Wei +2

Natural Language Processing Red-Teaming & Adversarial Robustness

OpenAI3w ago·also Hangzhou High-Tech Zone (Binjiang), HuggingFace

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

By pinpointing the causal origins of tool use, AttriGuard neutralizes indirect prompt injection attacks that can hijack LLM agents, even when faced with adversarial optimization.

Yu He, Haozhe Zhu, Yiming Li +4

Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago

WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference

You can now stealthily map the communication network of LLM agent swarms by compromising just *one* agent, even when jailbreaks fail and defenses are active.

Zixun Xiong, Gaoyi Wu, Lingfeng Yao +2

Red-Teaming & Adversarial Robustness Tool Use & Agents

CMU ML3w ago

RCTs&Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI Evaluation

Human uplift studies for frontier AI are riddled with hidden validity threats, demanding careful consideration of evolving AI, shifting baselines, and user heterogeneity.

Patricia Paskov, Kevin Wei, Shengxin Hong +7

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago·also NASA

Silent Subversion: Sensor Spoofing Attacks via Supply Chain Implants in Satellite Systems

A compromised component planted in a satellite's supply chain can silently subvert mission integrity by spoofing telemetry, even fooling ground operators and onboard estimators.

Jack Vanlyssel, Gruia-Catalin Roman, Afsah Anwar

Red-Teaming & Adversarial Robustness

3w ago·also Soochow, UESTC, USTC

Don't Let the Claw Grip Your Hand: A Security Analysis and Defense Framework for OpenClaw

Open-source code agents like OpenClaw are sitting ducks for shell command attacks, but a simple human-in-the-loop intervention can dramatically boost their security.

Zhengyang Shan, Jiayu Xin, Yue Zhang +1

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Tool Use & Agents

Sunpill Kim +33w ago

Na\"ive Exposure of Generative AI Capabilities Undermines Deepfake Detection

Generative AI's ability to reason about and refine images based on authenticity criteria inadvertently creates a powerful evasion strategy that renders current deepfake detectors ineffective.

Sunpill Kim, Chan-Hue Hwang, Minsu Kim +1

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

3w ago

Security-by-Design for LLM-Based Code Generation: Leveraging Internal Representations for Concept-Driven Steering Mechanisms

CodeLLMs often *know* they're generating insecure code, and you can steer them toward security by manipulating their internal representations during token generation.

Maximilian Wendlinger, Daniel Kowatsch, Konstantin Bottinger +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Sengim Karayalcin +33w ago

Backdoor Directions in Vision Transformers

Backdoor triggers in ViTs leave a surprisingly clear signature: a linear direction in activation space that can be directly manipulated to activate or deactivate the backdoor.

Sengim Karayalcin, Marina Krček, Pin-Yu Chen +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Red-Teaming & Adversarial Robustness

Mar 10, 2026

Bochra Al Agha +13w ago

Benchmarking Dataset for Presence-Only Passive Reconnaissance in Wireless Smart-Grid Communications

Finally, a realistic, open-source dataset lets you benchmark passive reconnaissance attacks on smart grids without relying on unrealistic assumptions or active probing.

Bochra Al Agha, Razane Tajeddine

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Bioaligned Labs3w ago·also Lawrence Berkeley National Lab, UC Riverside

Bioalignment: Measuring and Improving LLM Disposition Toward Biological Systems for AI Safety

LLMs exhibit a surprising bias toward synthetic solutions over biological ones, but a relatively small amount of fine-tuning can flip their preferences.

Trent R Northen, Mingxun Wang

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago

Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition

A Goldilocks zone exists for neural audio codec quantization depth, where intermediate levels strike the best balance between suppressing adversarial noise and preserving speech content for robust ASR.

J. Prescott, Thanathai Lertpetchpun, Shrikanth S. Narayanan

Inference & Quantization Red-Teaming & Adversarial Robustness Speech & Audio

360 AI Security Lab3w ago·also Beihang, College of Science

Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models

LVLMs can be jailbroken by "Reasoning-Oriented Programming," which chains together harmless visual inputs to trigger harmful reasoning, much like return-oriented programming in traditional security exploits.

Quanchen Zou, Moyang Chen, Zonghao Ying +5

Multimodal Models Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Shaswata Mitra +43w ago

AgenticCyOps: Securing Multi-Agentic AI Integration in Enterprise Cyber Operations

Securing enterprise multi-agent systems boils down to rigorously controlling tool orchestration and memory management, which can slash exploitable trust boundaries by over 70%.

Shaswata Mitra, Raj Patel, Sudip Mittal +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago·also NEC Laboratories America, University of Kansas

Sim2Act: Robust Simulation-to-Decision Learning via Adversarial Calibration and Group-Relative Perturbation

Stop letting simulator errors in critical regions derail your policies: Sim2Act aligns surrogate fidelity with downstream decision impact, leading to more stable and robust decision-making.

Hongyu Cao, Jinghan Zhang, Kunpeng Liu +5

Red-Teaming & Adversarial Robustness Robotics & Embodied AI World Models & Planning

University of Bergen3w ago·also Radboud, TU Delft, University of Zagreb

Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors

Backdoor defenses focused on removing training triggers are fundamentally flawed, as alternative, perceptually distinct triggers can reliably activate the same backdoor via a latent feature-space direction.

Gorka Abad, Ermes Franch, Stefanos Koffas +1

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Yanan Li3w ago

Robust Provably Secure Image Steganography via Latent Iterative Optimization

Provably secure steganography can now withstand real-world image compression and processing thanks to a clever latent-space optimization technique.

Yanan Li

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Lifeng Zhuo +13w ago

RESBev: Making BEV Perception More Robust

A plug-and-play module, RESBev, fortifies BEV perception against sensor degradation and adversarial attacks by learning latent BEV state transitions, offering a practical route to more reliable autonomous driving systems.

Lifeng Zhuo, Kefan Jin

Computer Vision Red-Teaming & Adversarial Robustness Robotics & Embodied AI

G. Edwards +13w ago

Synergistic Directed Execution and LLM-Driven Analysis for Zero-Day AI-Generated Malware Detection

LLMs can now help you catch AI-generated malware: a hybrid analysis framework uses LLMs to guide concolic execution and deep learning to classify vulnerabilities, achieving state-of-the-art detection rates.

G. Edwards, Mahdi Eslamimehr

Code Generation & Program Synthesis Natural Language Processing Red-Teaming & Adversarial Robustness

Anthropic3w ago·also UCL, UCR

CLIOPATRA: Extracting Private Information from LLM Insights

Privacy-preserving LLM insight systems like Anthropic's Clio can be tricked into leaking a user's medical history with just a single symptom and basic demographics, even with layered heuristic defenses.

Meenatchi Sundaram Muthu Selva Annamalai, Emiliano De Cristofaro, Peter Kairouz

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Microsoft Research3w ago

CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research?

LLMs still can't automate real-world threat research, struggling with accuracy and nuanced expertise in a new benchmark derived from a world-leading company's CTI workflow.

Xiangsen Chen, Xuan Feng, Shuo Chen +4

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

3w ago·also Corresponding author(), Zhongguancun Laboratory

ProvAgent: Threat Detection Based on Identity-Behavior Binding and Multi-Agent Collaborative Attack Investigation

ProvAgent slashes the cost of reconstructing near-complete attack processes to just $0.06 per day by replacing human analysts with a multi-agent system for threat investigation.

Wenhao Yan, Ning An, Linxu Li +6

Red-Teaming & Adversarial Robustness Tool Use & Agents

Willie Kouam +13w ago

Game-Theoretic Modeling of Stealthy Intrusion Defense against MDP-Based Attackers

Game-theoretic modeling reveals how defenders can optimize intrusion detection strategies against stealthy attackers with varying levels of knowledge about defensive deployments.

Willie Kouam, Stefan Rass

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Yuqi Qian +43w ago

ShapeMark: Robust and Diversity-Preserving Watermarking for Diffusion Models

ShapeMark watermarks survive heavy image degradation by encoding bits into structured noise patterns, unlike existing methods that embed in individual pixel values.

Yuqi Qian, Yun Cao, Haocheng Fu +2

Computer Vision Red-Teaming & Adversarial Robustness

Lorenzo Corrias +23w ago·also University of Cagliari

An Analysis of Modern Web Security Vulnerabilities Inside WebAssembly Applications

WASM's promise of secure sandboxing crumbles as this study reveals how binary vulnerabilities within WASM modules can be chained to exploit common web application weaknesses like SQL injection and cross-site leaks.

Lorenzo Corrias, Lorenzo Pisu, Davide Maiorca

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness

Jake Gonzales +33w ago

Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation

Ditch brittle Nash equilibria: a new algorithm finds more robust MARL policies by tuning risk sensitivity and rationality.

Jake Gonzales, Maxim D. Horwitz, Eric V. Mazumdar +1

Red-Teaming & Adversarial Robustness Robotics & Embodied AI

3w ago

OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

MLLMs can be blind to the consequences of their actions, and simply scaling model size won't fix the problem.

Ming Wen, Kun Yang, Jingyu Zhang +2

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Mar 9, 2026

Tam Nguyen +23w ago

Security Considerations for Multi-agent Systems

Current AI security frameworks are woefully inadequate for multi-agent systems, leaving critical vulnerabilities like non-determinism and data leakage largely unaddressed.

Tam Nguyen, M. Ndebugre, Dheeraj Arremsetty

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago

All Vehicles Can Lie: Efficient Adversarial Defense in Fully Untrusted-Vehicle Collaborative Perception via Pseudo-Random Bayesian Inference

Forget complex classifiers – this defense against adversarial attacks in collaborative perception uses temporal discrepancies and Bayesian inference to pinpoint malicious vehicles with minimal overhead.

Yi Yu, Libing Wu, Zhuangzhuang Zhang +3

Computer Vision Red-Teaming & Adversarial Robustness Robotics & Embodied AI

3w ago

Prototype-Guided Concept Erasure in Diffusion Models

Reliably erase broad concepts like "sexual" or "violent" from diffusion models by using learned concept prototypes as negative guidance, outperforming existing methods.

Yuze Cai, Jiahao Lu, Hongxiang Shi +2

Computer Vision Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

3w ago·also MTLab

ConflictBench: Evaluating Human-AI Conflict via Interactive and Visually Grounded Environments

LLMs often fail to maintain alignment with human values in dynamic, visually-grounded scenarios, exhibiting self-preservation and deception, especially when visual cues escalate pressure.

Weixiang Zhao, Haozhen Li, Yanyan Zhao +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Urawee Thani +23w ago

Unsupervised Domain Adaptation for Audio Deepfake Detection with Modular Statistical Transformations

A modular statistical transformation pipeline boosts audio deepfake detection accuracy by 10.7% in cross-domain scenarios, without needing labeled target data.

Urawee Thani, Gagandeep Singh, Priyanka Singh

Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness Speech & Audio

3w ago

Arbiter: Detecting Interference in LLM Agent System Prompts

For pennies, a new framework reveals critical vulnerabilities in the system prompts of leading coding agents like Claude, Codex, and Gemini, demonstrating the power of multi-model LLM scouring.

Tony Mason

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Tool Use & Agents

Yi Chen +43w ago

SCAFFOLD-CEGIS: Preventing Latent Security Degradation in LLM-Driven Iterative Code Refinement

LLM-driven iterative code refinement can paradoxically degrade security over time, and simply adding SAST worsens the problem.

Yi Chen, Yun Bian, Haiquan Wang +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago

The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

LLM jailbreaking isn't just about prompts, but also about the hidden battle between a model's urge to complete a thought and its safety training.

Yonghong Deng, Ping Jian, Xinyue Zhang +2

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Mingxi Zou +53w ago

DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

Mitigate the brittleness of RLHF by explicitly controlling for disagreement and tail risk during inference, without retraining, using a KL-robust optimization framework.

Mingxi Zou, Jiaxiang Chen, Junfan Li +3

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Ulm University3w ago

ALOOD: Exploiting Language Representations for LiDAR-based Out-of-Distribution Object Detection

LiDAR object detectors can now spot the unexpected by borrowing language understanding from vision-language models, turning OOD detection into a zero-shot game.

Michael Kösel, Marcel Schreiber, Michael Ulrich +2

Computer Vision Red-Teaming & Adversarial Robustness Robotics & Embodied AI

Guangnian Wan +23w ago

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

LLMs can be finetuned to hide malicious prompts and responses in plain sight using steganography, bypassing safety filters and creating an "invisible safety threat."

Guangnian Wan, Xinyin Ma, Gongfan Fang

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

3w ago

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Fine-tuning VLMs on threat-related images alone can significantly improve safety without any explicit safety labels, revealing a surprising visual pathway for alignment.

Constitutional AI & AI Ethics Multimodal Models Red-Teaming & Adversarial Robustness

Alias Robotics3w ago

Cybersecurity AI: Hacking Consumer Robots in the AI Era

Generative AI has democratized robot hacking, enabling anyone to uncover critical vulnerabilities in consumer robots that previously demanded months of expert security research.

Víctor Mayoral-Vilches, V'ictor Mayoral-Vilches, Unai Ayucar-Carbajo +9

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Robotics & Embodied AI

3w ago·also UConn, UTHealth

Quantifying Memorization and Privacy Risks in Genomic Language Models

Genomic language models memorize training data, raising privacy concerns, and this study shows that no single memorization attack can fully capture the risk, necessitating a multi-vector approach to auditing.

Alexander Nemecek, Wenbiao Li, Xiaoqian Jiang +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Scientific Discovery & Drug Design

University of Haute-Alsace3w ago·also American University of the Middle, IRIMAS

A Comparative Study of Recent Advances in Internet of Intrusion Detection Things

Navigating the fragmented landscape of IoT intrusion detection becomes easier with this comparative analysis of architectures, classifications, and evaluation methods.

Marianna Rezk, Hassan Harb, Ismail Bennis +4

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Yehonatan Elisha +23w ago·also Tel-Aviv University

Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness

By aligning ViT attention with automatically generated, concept-level masks, this fine-tuning method substantially boosts robustness to distribution shifts, outperforming standard regularization techniques.

Yehonatan Elisha, Oren Barkan, Noam Koenigstein

Architecture Design (Transformers, SSMs, MoE)Computer Vision Red-Teaming & Adversarial Robustness

Andrea Agiollo +63w ago

SoK: Harmonizing Attack Graphs and Intrusion Detection Systems

Current approaches to integrating Attack Graphs and Intrusion Detection Systems are piecemeal, highlighting the need for a unified framework that treats them as a cohesive system.

Andrea Agiollo, Enkeleda Bardhi, Alessandro Palma +4

Red-Teaming & Adversarial Robustness

Aishwarya Fursule +23w ago·also Institut national de la recherche

Gender Fairness in Audio Deepfake Detection: Performance and Disparity Analysis

Even when overall accuracy seems balanced, audio deepfake detection models can exhibit significant gender bias, masked by standard metrics like EER.

Aishwarya Fursule, S. Kshirsagar, Anderson R. Avila

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Speech & Audio

Pratyay Kumar +63w ago

NetDiffuser: Deceiving DNN-Based Network Attack Detection Systems with Diffusion-Generated Adversarial Traffic

Diffusion models can craft network attack traffic that's nearly undetectable to state-of-the-art intrusion detection systems, achieving a ~30% higher success rate than previous methods.

Pratyay Kumar, A. S. Tayeen, S. Misra +4

Natural Language Processing Red-Teaming & Adversarial Robustness

Daniil Karzanov +13w ago

Geometrically Constrained Outlier Synthesis

By synthesizing outliers that respect the learned manifold structure, GCOS enables deep networks to more robustly distinguish between in- and out-of-distribution samples, leading to state-of-the-art performance on near-OOD detection.

Daniil Karzanov, Marcin Detyniecki

Computer Vision Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness

3w ago

SlowBA: An efficiency backdoor attack towards VLM-based GUI agents

VLM-based GUI agents are vulnerable to "SlowBA," a backdoor attack that stealthily inflates response times without affecting task accuracy, revealing a new dimension of security risk beyond action correctness.

Junxian Li, Tu Lan, T.Y. Lan +3

Multimodal Models Red-Teaming & Adversarial Robustness Tool Use & Agents

Youngseo Kim +53w ago

X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection

Uncover deepfakes by exploiting the tell-tale audio-visual inconsistencies embedded within generative models' cross-attention mechanisms.

Youngseo Kim, Kwan Yun, Seokhyeon Hong +3

Multimodal Models Red-Teaming & Adversarial Robustness Speech & Audio

Saeed Asadi +13w ago

Generative Adversarial Regression (GAR): Learning Conditional Risk Scenarios

Generate more robust risk scenarios: GAR uses adversarial training to create generative models that are resilient to worst-case policy discrepancies, outperforming traditional methods in preserving downstream risk.

Saeed Asadi, Jonathan Yu-Meng Li

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

3w ago

Outlier-robust Autocovariance Least Square Estimation via Iteratively Reweighted Least Square

Even with heavy noise and outliers, this new algorithm estimates noise covariances for Kalman filters so well that it nearly matches the impossible-to-achieve "Oracle" lower bound on performance.

Jiahong Li, Fang Deng

Red-Teaming & Adversarial Robustness Robotics & Embodied AI

3w ago

The Conundrum of Trustworthy Research on Attacking Personally Identifiable Information Removal Techniques

Reported successes in reconstructing PII from sanitized documents may be overstated due to data leakage, leaving the true vulnerability of PII removal techniques uncertain.

Sebastian Ochs, Ivan Habernal

Data Curation & Synthetic Data Natural Language Processing Red-Teaming & Adversarial Robustness

Klaas Ole Kürtz +13w ago

Towards Modeling Cybersecurity Behavior of Humans in Organizations

Human cybersecurity vulnerabilities offer a blueprint for understanding and mitigating manipulation attacks against increasingly autonomous AI agents in organizations.

Klaas Ole Kürtz, Klaas Ole Kurtz

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

3w ago

SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution

Achieve over 90% accuracy in attributing generated videos to their source model with as few as 20 samples, all without training or modifying the videos themselves.

Zijin Yang, Yaofei Wang, Yuang Qi +3

Computer Vision Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Ismail Lamaakal +53w ago·also Mohammed First University

Drift-to-Action Controllers: Budgeted Interventions with Online Risk Certificates

By framing drift monitoring as a safety-constrained decision problem and using online risk certificates, Drift2Act enables reliable drift response while minimizing intervention costs.

Ismail Lamaakal, Chaymae Yahyati, K. Makkaoui +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Jialin Ying +43w ago

Less is More: Robust Zero-Communication 3D Pursuit-Evasion via Representational Parsimony

Stripping away seemingly helpful information from agents' observations can actually *improve* the robustness of multi-agent coordination in communication-constrained environments.

Jialin Ying, Zhihao Li, Zicheng Dong +2

Red-Teaming & Adversarial Robustness Robotics & Embodied AI World Models & Planning

Ali Fattahdizaji +33w ago

SmartGraphical: A Human-in-the-Loop Framework for Detecting Smart Contract Logical Vulnerabilities via Pattern-Driven Static Analysis and Visual Abstraction

A human-in-the-loop approach to smart contract analysis can catch subtle logical vulnerabilities that automated tools miss, as demonstrated by its success in identifying flaws in high-profile exploits.

Ali Fattahdizaji, Mohammad Pishdar, Z. Shukur +1

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness

Mar 8, 2026

3w ago

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

You can now poison a zero-shot TTS model to prevent it from generating speech for specific target speakers, but scaling this defense to a large number of speakers remains a challenge.

Sai Praneeth Karimireddy

Natural Language Processing Red-Teaming & Adversarial Robustness Speech & Audio

3w ago

Broken Access: On the Challenges of Screen Reader Assisted Two-Factor and Passwordless Authentication

Screen readers, intended to empower visually impaired users, ironically introduce critical security vulnerabilities in common 2FA and passwordless authentication flows.

Md Mojibur Rahman Redoy Akanda, Ahmed Tanvir Mahdad, Nitesh Saxena

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Neha Nagaraja +13w ago·also Cyber Systems Northern Arizona, Tallinn University of Technology

Where Do LLM-based Systems Break? A System-Level Security Framework for Risk Assessment and Treatment

LLM-powered systems are surprisingly vulnerable to multi-pronged attacks that combine conventional cyber threats, adversarial ML, and conversational manipulation, all converging on a few key weaknesses.

Neha Nagaraja, Hayretdin Bahsi

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago·also Shenzhen Institute of Advanced

AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

LLMs exhibit an "Alignment Illusion," where their apparent safety collapses under pressure, with the most capable models showing the most dramatic failures.

Changyi Li, Pengfei Lu, Xudong Pan +2

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Sumit Ranjan +33w ago

VoiceSHIELD-Small: Real-Time Malicious Speech Detection and Transcription

Achieve near-perfect accuracy in real-time malicious speech detection without sacrificing transcription speed, using a lightweight model built on Whisper.

Sumit Ranjan, Sugandha Sharma, Ubaid Abbas +1

Natural Language Processing Red-Teaming & Adversarial Robustness Speech & Audio

3w ago

Reality Check for Tor Website Fingerprinting in the Open World

Website fingerprinting attacks on Tor are still alarmingly effective in the real world, achieving >90% precision and recall even against realistic background noise and network jitter.

Mohammadhamed Shadbeh, Khashayar Khajavi

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago·also HKU, Soochow, UESTC, UNC +1

Give Them an Inch and They Will Take a Mile:Understanding and Measuring Caller Identity Confusion in MCP-Based AI Systems

MCP-based AI systems are alarmingly vulnerable to caller identity confusion, allowing unauthorized access to sensitive tools and operations after just one initial authorization.

Yuhang Huang, Boyang Ma, Biwei Yan +5

Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

Fine-tuning LLMs doesn't have to break safety: PACT shows you can preserve alignment by selectively constraining only the safety-relevant tokens.

Guoli Wang, Haonan Shi, Tu Ouyang +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Rezvi Shahariar3w ago

Evaluating Granularity in Markov Chain-Based Trust Models for Vehicular Ad Hoc Networks (VANETs)

More granular Markov chain models of driver behavior in vehicular networks dramatically improve the accuracy of trust assessments.

Rezvi Shahariar

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

3w ago

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Most output-level defenses against LLM knowledge distillation are surprisingly weak, failing to prevent knowledge theft even from naive attackers.

Eval Frameworks & Benchmarks Inference & Quantization Red-Teaming & Adversarial Robustness

Prabhudarshi Nayak +33w ago

Post-quantum Federated Learning: Secure And Scalable Threat Intelligence For Collaborative Cyber Defense

Quantum computers can break federated learning's classical encryption, but this post-quantum cryptography framework keeps threat intelligence sharing secure with minimal performance hit.

Prabhudarshi Nayak, Gogulakrishnan Thiyagarajan, Ritunsa Mishra +1

Distributed Systems & Hardware Red-Teaming & Adversarial Robustness

3w ago·also CUHK, HKU, ZJU

From Thinker to Society: Security in Hierarchical Autonomy Evolution of AI Agents

Today's AI agent security frameworks are failing to keep pace with the rising tide of threats arising from autonomous decision-making and environmental interaction.

Xiaolei Zhang, Lu Zhou, Xiaogang Xu +3

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

3w ago

Learning the APT Kill Chain: Temporal Reasoning over Provenance Data for Attack Stage Estimation

Fusing graph neural networks and LSTMs over provenance data enables 31% more stable and accurate estimation of APT attack stages, a leap beyond existing methods.

Natural Language Processing Red-Teaming & Adversarial Robustness

3w ago·also UCF

AgentRaft: Automated Detection of Data Over-Exposure in LLM Agents

Over half of LLM agent tool interactions leak sensitive data, and AgentRaft can catch them with high accuracy.

Yixi Lin, Jiangrong Wu, Yuhong Nan +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Universitat Rovira i Virgili3w ago

Revisiting the LiRA Membership Inference Attack Under Realistic Assumptions

Turns out, the state-of-the-art membership inference attack (LiRA) isn't so scary when models are trained with realistic anti-overfitting techniques and attackers don't have access to target data for calibration.

Najeeb Jebreel, Mona Khalil, David Sánchez +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago·also Melbourne

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

Backdoors aren't just for attacks anymore: B4G shows how they can be flipped to enhance LLM safety, controllability, and accountability.

Yige Li, Nay Myat Min, Hanxun Huang +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Mar 5, 2026

Md. Sadik Awal +13w ago

ShieldBypass: On the Persistence of Impedance Leakage Beyond EM Shielding

Even with EM shielding in place, active RF probing can still expose execution-dependent behavior via impedance-modulated backscattering.

Md. Sadik Awal, Md Tauhidur Rahman

Red-Teaming & Adversarial Robustness

Ruichen Xu +13w ago

Differential Privacy in Two-Layer Networks: How DP-SGD Harms Fairness and Robustness

Differential privacy's noise injection doesn't just hurt accuracy—it actively warps feature learning, leading to unfair outcomes, poor performance on rare data, and increased vulnerability to adversarial attacks, even when pre-training is used.

Ruichen Xu, Kexin Chen

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

3w ago·also IIS Academia Sinica, NYCU

Latent-Mark: An Audio Watermark Robust to Neural Resynthesis

Audio watermarks can now survive neural resynthesis, thanks to a latent space embedding technique that resists semantic compression by modern audio codecs.

Yen-Shan Chen, Shih-Yu Lai, Ying-Jung Tsou +5

Red-Teaming & Adversarial Robustness Speech & Audio

Tsinghua AI3w ago·also SCB DataX, SCBX R&D

ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

Current LLM safety measures are critically vulnerable to attacks grounded in Thai cultural nuances, as demonstrated by a new benchmark showing higher attack success rates compared to general Thai-language attacks.

Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Han Yin +43w ago

The First Environmental Sound Deepfake Detection Challenge: Benchmarking Robustness, Evaluation, and Insights

Environmental sound deepfakes are a rising threat, and this challenge reveals the current state-of-the-art in detecting them, highlighting both the progress and remaining gaps.

Han Yin, Yang Xiao, Rohan Kumar Das +2

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Speech & Audio

Quoc Khoa Tran +23w ago

Detection of Illicit Content on Online Marketplaces using Large Language Models

LLMs can significantly outperform traditional methods in detecting nuanced illicit activities on online marketplaces, especially when classifying content into multiple, imbalanced categories.

Quoc Khoa Tran, Thanh Thi Nguyen, Campbell Wilson

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Lianyu Wang +33w ago

Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs

VLMs can now dynamically adapt to changing deployment environments with user-controlled authorization, thanks to a new framework that protects intellectual property while maintaining performance.

Lianyu Wang, Meng Wang, Huazhu Fu +1

Computer Vision Multimodal Models Open-Source Models & Weights+1

Fai Gu +53w ago

When Denoising Becomes Unsigning: Theoretical and Empirical Analysis of Watermark Fragility Under Diffusion-Based Image Editing

Diffusion-based image editing can effectively erase robust watermarks, turning them into random noise even when those watermarks were designed to survive conventional distortions.

Fai Gu, Qiyu Tang, Te-Jen Wen +3

Computer Vision Red-Teaming & Adversarial Robustness

Ji-in Jeong +13w ago

Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models

AI models are more like patients than black boxes: "Model Medicine" offers a clinical framework and open-source tools to diagnose and treat their "ailments."

Ji-in Jeong, Jihoon Jeong

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

3w ago·also Warsaw University of Technology IDEAS

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Censored LLMs offer a surprisingly natural and effective environment for stress-testing methods that aim to elicit truthfulness and detect deception.

Helena Casademunt, Helena Casademunt, Bartosz Cywi'nski +9

Eval Frameworks & Benchmarks Open-Source Models & Weights Red-Teaming & Adversarial Robustness

3w ago·also DTU

The Impact of Preprocessing Methods on Racial Encoding and Model Robustness in CXR Diagnosis

Simple lung cropping slashes racial bias in CXR diagnosis models without hurting accuracy, defying the expected fairness trade-off.

Dishantkumar Sutariya, Eike Petersen

Computer Vision Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness