Microsoft ResearchJun 8, 2026arXiv:2606.09701

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

Blake Bullwinkel, Eugenia Kim, Amanda Minnich, Mark Russinovich

AI Summary

This paper introduces AdvGRPO, a novel co-training framework that stabilizes the use of Generalized Reinforcement Policy Optimization (GRPO) for joint attacker-defender optimization in AI red teaming. By employing dense multi-channel rewards and decoupled advantage normalization, the authors demonstrate that their method not only generates highly effective and transferable attacks but also enhances the performance of co-trained defenders on safety benchmarks. The results highlight the potential for adaptive red teaming strategies that can keep pace with evolving threats in AI systems.

Key Contribution

AdvGRPO enables robust attacker-defender co-training that significantly improves defender performance on safety benchmarks while generating effective attacks.

Abstract

AI red teaming must continually adapt to evolving attackers and defenders. Reinforcement learning offers a promising approach to discovering novel attacks, and co-training methods can produce more robust defenders in tandem. Recent works have demonstrated the efficacy of attacker-defender co-training by applying PPO and DPO, but report that GRPO is unstable in this setting. We introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization using dense multi-channel rewards and decoupled advantage normalization. Training progresses through a curriculum from single-turn to closed-loop multi-turn attacks before bootstrapping co-training, where attacker and defender models are updated in alternation. We show that our method can produce highly effective and transferable attacks and that co-trained defenders outperform baselines on safety benchmarks.

Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

Related Papers