D cubic B-spline basis. FurtherUNSWApr 19, 2025arXiv:2504.14177

Direct Advantage Regression: Aligning LLMs with Online AI Reward

Li He, He Zhao, Stephen Wan, Dadong Wang, Lina Yao, Tongliang Liu

AI Summary

The paper introduces Direct Advantage Regression (DAR), a novel alignment algorithm for LLMs that leverages online AI reward signals for weighted supervised fine-tuning, offering a more fine-grained supervision compared to binary preference-based OAIF. DAR maintains theoretical consistency with online RLHF while simplifying implementation and improving learning efficiency by eliminating the need for reinforcement learning. Empirical results demonstrate that DAR achieves higher human-AI agreement and outperforms OAIF and online RLHF baselines on GPT-4-Turbo and MT-bench.

Key Contribution

LLMs learn better from AI *reward* than AI *preference*, leading to higher human-AI agreement and improved performance compared to standard online AI feedback and RLHF.

Abstract

Online AI Feedback (OAIF) presents a promising alternative to Reinforcement Learning from Human Feedback (RLHF) by utilizing online AI preference in aligning language models (LLMs). However, the straightforward replacement of humans with AI deprives LLMs from learning more fine-grained AI supervision beyond binary signals. In this paper, we propose Direct Advantage Regression (DAR), a simple alignment algorithm using online AI reward to optimize policy improvement through weighted supervised fine-tuning. As an RL-free approach, DAR maintains theoretical consistency with online RLHF pipelines while significantly reducing implementation complexity and improving learning efficiency. Our empirical results underscore that AI reward is a better form of AI supervision consistently achieving higher human-AI agreement as opposed to AI preference. Additionally, evaluations using GPT-4-Turbo and MT-bench show that DAR outperforms both OAIF and online RLHF baselines.

Natural Language Processing RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References40

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Direct Advantage Regression: Aligning LLMs with Online AI Reward

Related Papers