Apr 28, 2026arXiv:2604.25203

BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate

A.J. Mazza, Arnon Mazza, Elad Levi, Elad Levi

AI Summary

BARRED generates synthetic training data for custom policy guardrails by decomposing the domain space into dimensions and using multi-agent debate to verify label correctness. This approach addresses the limitations of generic safety models and the high costs of human-labeled data for training custom classifiers. Fine-tuning small language models on BARRED-generated data outperforms proprietary LLMs and dedicated guardrail models across diverse custom policies.

Key Contribution

Forget expensive human labeling: BARRED lets you train custom policy guardrails that outperform state-of-the-art LLMs using only synthetic data generated via multi-agent debate.

Abstract

Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture task-specific requirements, while prompting LLMs suffers from inconsistent boundary-case performance and high inference costs. Training custom classifiers achieves both accuracy and efficiency, yet demands substantial labeled data that is costly to obtain. We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework for generating faithful and diverse synthetic training data using only a task description and a small set of unlabeled examples. Our approach decomposes the domain space into dimensions to ensure comprehensive coverage, and employs multi-agent debate to verify label correctness, yielding a high-fidelity training corpus. Experiments across diverse custom policies demonstrate that small language models finetuned on our synthetic data consistently outperform state-of-the-art proprietary LLMs (including reasoning models) and dedicated guardrail models. Ablation studies confirm that both dimension decomposition and debate-based verification are critical for ensuring the diversity and label fidelity required for effective fine-tuning. The BARRED framework eliminates the reliance on extensive human annotation, offering a scalable solution for accurate custom guardrails.

Constitutional AI & AI Ethics Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate

Related Papers