Search papers, labs, and topics across Lattice.
This paper introduces Conformal Feedback Alignment (CFA), a framework that leverages conformal prediction to quantify the reliability of individual answers used in preference-based alignment. CFA constructs conformal prediction sets for each answer, aggregates these into reliability scores, and uses these scores as weights in DPO and PPO training. Experiments demonstrate that CFA improves alignment robustness and data efficiency by explicitly modeling answer-side uncertainty, complementing existing preference-level weighting schemes.
Forget weighting preferences alone – this new method uses conformal prediction to directly quantify and leverage the reliability of the *answers* themselves, leading to more robust and data-efficient LLM alignment.
Preference-based alignment like Reinforcement Learning from Human Feedback (RLHF) learns from pairwise preferences, yet the labels are often noisy and inconsistent. Existing uncertainty-aware approaches weight preferences, but ignore a more fundamental factor: the reliability of the \emph{answers} being compared. To address the problem, we propose Conformal Feedback Alignment (CFA), a framework that grounds preference weighting in the statistical guarantees of Conformal Prediction (CP). CFA quantifies answer-level reliability by constructing conformal prediction sets with controllable coverage and aggregates these reliabilities into principled weights for both DPO- and PPO-style training. Experiments across different datasets show that CFA improves alignment robustness and data efficiency, highlighting that modeling \emph{answer-side} uncertainty complements preference-level weighting and yields more robust, data-efficient alignment. Codes are provided here.