Search papers, labs, and topics across Lattice.
This paper introduces AdvGRPO, a novel co-training framework that stabilizes the use of Generalized Reinforcement Policy Optimization (GRPO) for joint attacker-defender optimization in AI red teaming. By employing dense multi-channel rewards and decoupled advantage normalization, the authors demonstrate that their method not only generates highly effective and transferable attacks but also enhances the performance of co-trained defenders on safety benchmarks. The results highlight the potential for adaptive red teaming strategies that can keep pace with evolving threats in AI systems.
AdvGRPO enables robust attacker-defender co-training that significantly improves defender performance on safety benchmarks while generating effective attacks.
AI red teaming must continually adapt to evolving attackers and defenders. Reinforcement learning offers a promising approach to discovering novel attacks, and co-training methods can produce more robust defenders in tandem. Recent works have demonstrated the efficacy of attacker-defender co-training by applying PPO and DPO, but report that GRPO is unstable in this setting. We introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization using dense multi-channel rewards and decoupled advantage normalization. Training progresses through a curriculum from single-turn to closed-loop multi-turn attacks before bootstrapping co-training, where attacker and defender models are updated in alternation. We show that our method can produce highly effective and transferable attacks and that co-trained defenders outperform baselines on safety benchmarks.