ECNUApr 20, 2026arXiv:2604.17769

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Yuan Fang, Yiming Luo, Aimin Zhou, Fei Tan

AI Summary

This paper introduces Reverse Constitutional AI (R-CAI), a novel framework for generating controllable toxic data by inverting a harmless constitution into a toxic one and refining outputs through a critique-revision pipeline. The method addresses the challenge of reward hacking in adversarial data generation by employing probability clamping within reinforcement learning from AI feedback, which stabilizes optimization and enhances semantic coherence. Experiments show that R-CAI not only produces diverse and high-quality toxic data but also improves semantic coherence by 15% while maintaining adversarial strength, marking a significant advancement in automated red teaming for language models.

Key Contribution

R-CAI can generate high-quality toxic data while improving semantic coherence, revolutionizing how we approach adversarial data synthesis for AI safety.

Abstract

Ensuring the safety of large language models (LLMs) requires robust red teaming, yet the systematic synthesis of high-quality toxic data remains under-explored. We propose Reverse Constitutional AI (R-CAI), a framework for automated and controllable adversarial data generation that moves beyond isolated jailbreak prompts. By inverting a harmless constitution into a constitution of toxicity and iteratively refining model outputs through a critique--revision pipeline, R-CAI enables scalable synthesis of multi-dimensional adversarial data without human annotation. Optimizing solely for toxicity-related rewards, however, can lead to reward hacking and degraded semantic coherence. To address this challenge, we introduce probability clamping within reinforcement learning from AI feedback, which stabilizes adversarial optimization while preserving adversarial intent. Experiments demonstrate that R-CAI generates diverse, high-quality toxic data and that probability clamping substantially improves semantic coherence (15%) without sacrificing adversarial strength. Overall, R-CAI provides a fully automated framework for red teaming data generation and systematic safety evaluation of aligned language models.

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Related Papers