NUSXiaohongshuFeb 24, 2026arXiv:2602.20670

CAMEL: Confidence-Gated Reflection for Reward Modeling

Zirui Zhu, Zirui Zhu, Hailun Xu, Yang Luo, Yang Luo, Kanchan Sarkar, Kanchan Sarkar, Kun Xu, Kun Xu, Yang You, Yang You

AI Summary

The paper introduces CAMEL, a confidence-gated reflection framework for reward modeling that selectively invokes reflection only for low-confidence instances based on the log-probability margin between verdict tokens. To improve self-correction, the model is trained via reinforcement learning with counterfactual prefix augmentation, exposing it to diverse initial verdicts. CAMEL achieves state-of-the-art performance on three reward-model benchmarks, surpassing previous models by 3.2% average accuracy and outperforming 70B-parameter models with only 14B parameters.

Key Contribution

A confidence-based gating mechanism lets a 14B parameter reward model outperform 70B parameter models, achieving a new accuracy-efficiency Pareto frontier.

Abstract

Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the log-probability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance difficulty without additional inference cost. Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances. To induce effective self-correction, we train the model via reinforcement learning with counterfactual prefix augmentation, which exposes the model to diverse initial verdicts and encourages genuine revision. Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters, while establishing a strictly better accuracy-efficiency Pareto frontier.

Interpretability & Mechanistic Interp RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References53

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CAMEL: Confidence-Gated Reflection for Reward Modeling

Related Papers