Notre DameWashington StateFeb 18, 2026arXiv:2602.16543

Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

Jialiang Fan, Shixiong Jiang, Mengyu Liu, Fanxin Kong

AI Summary

The paper introduces a black-box adversarial attack framework, Inverse Constrained Reinforcement Learning (ICRL), to evaluate the robustness of Safe RL policies against adversarial perturbations. ICRL learns a constraint model and surrogate policy from expert demonstrations and environment interactions, enabling gradient-based attacks without needing the victim policy's gradients or true safety constraints. Experiments on Safe RL benchmarks demonstrate the framework's effectiveness in revealing vulnerabilities under limited access.

Key Contribution

Safe RL policies, designed to avoid unsafe actions, can be effectively attacked using a novel framework that learns safety constraints from demonstrations and then crafts adversarial perturbations, even without access to the target policy's gradients.

Abstract

Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy's internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach under limited privileged access.

Red-Teaming & Adversarial Robustness RLHF & Preference Learning Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

Related Papers