KAISTFeb 26, 2026arXiv:2602.23116

Regularized Online RLHF with Generalized Bilinear Preferences

Junghyun Lee, Jung-Hyun Lee, Minju Hong, Minju Hong, Kwang-Sung Jun, Kwang-Sung Jun, Chulhee Yun, Chulhee Yun, Se-Young Yun, Se-Young Yun

AI Summary

This paper addresses contextual online Reinforcement Learning from Human Feedback (RLHF) with potentially intransitive preferences modeled by the Generalized Bilinear Preference Model (GBPM). It derives a dual gap bound based on strong convexity and GBPM's skew-symmetry, enabling regret analysis for general strongly convex regularizers, going beyond reverse KL regularization. The paper presents two algorithms, Greedy Sampling and Explore-Then-Commit, achieving polylogarithmic and square-root regret bounds, respectively, with the latter being the first statistically efficient guarantee for online RLHF in high dimensions.

Key Contribution

Statistically efficient online RLHF is now possible in high-dimensional settings, thanks to a novel analysis leveraging strong convexity and skew-symmetry in generalized bilinear preference models.

Abstract

We consider the problem of contextual online RLHF with general preferences, where the goal is to identify the Nash Equilibrium. We adopt the Generalized Bilinear Preference Model (GBPM) to capture potentially intransitive preferences via low-rank, skew-symmetric matrices. We investigate general preference learning with any strongly convex regularizer (where $\eta^{-1}$ is the regularization strength), generalizing beyond prior works limited to reverse KL-regularization. Central to our analysis is proving that the dual gap of the greedy policy is bounded by the square of the estimation error - a result derived solely from strong convexity and the skew-symmetricity of GBPM.Building on this insight and a feature diversity assumption, we establish two regret bounds via two simple algorithms: (1) Greedy Sampling achieves polylogarithmic, $e^{O(\eta)}$-free regret $\tilde{O}(\eta d^4 (\log T)^2)$. (2) Explore-Then-Commit achieves $\mathrm{poly}(d)$-free regret $\tilde{O}(\sqrt{\eta r T})$ by exploiting the low-rank structure; this is the first statistically efficient guarantee for online RLHF in high-dimensions.

RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References150

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Regularized Online RLHF with Generalized Bilinear Preferences

Related Papers