Search papers, labs, and topics across Lattice.
The paper introduces a black-box adversarial attack framework, Inverse Constrained Reinforcement Learning (ICRL), to evaluate the robustness of Safe RL policies against adversarial perturbations. ICRL learns a constraint model and surrogate policy from expert demonstrations and environment interactions, enabling gradient-based attacks without needing the victim policy's gradients or true safety constraints. Experiments on Safe RL benchmarks demonstrate the framework's effectiveness in revealing vulnerabilities under limited access.
Safe RL policies, designed to avoid unsafe actions, can be effectively attacked using a novel framework that learns safety constraints from demonstrations and then crafts adversarial perturbations, even without access to the target policy's gradients.
Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy's internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach under limited privileged access.