Search papers, labs, and topics across Lattice.
This paper introduces an online RLHF algorithm incorporating affirmative nudges, epistemic neural networks for reward uncertainty, and information-directed exploration to enhance data efficiency. The algorithm iteratively refines reward and language models based on incoming choice data, using a REINFORCE variant for language model updates. Experiments with Gemma LLMs show the online approach achieves comparable performance to offline RLHF trained on 200K labels with only 20K labels, suggesting potential for 1000x efficiency gains at larger scales.
Online RLHF can match the performance of offline RLHF with 10x less data, and potentially 1000x at scale.
We develop an online learning algorithm that dramatically improves the data efficiency of reinforcement learning from human feedback (RLHF). Our algorithm incrementally updates reward and language models as choice data is received. The reward model is fit to the choice data, while the language model is updated by a variation of reinforce, with reinforcement signals provided by the reward model. Several features enable the efficiency gains: a small affirmative nudge added to each reinforcement signal, an epistemic neural network that models reward uncertainty, and information-directed exploration. With Gemma large language models (LLMs), our algorithm matches the performance of offline RLHF trained on 200K labels using fewer than 20K labels, representing more than a 10x gain in data efficiency. Extrapolating from our results, we expect our algorithm trained on 1M labels to match offline RLHF trained on 1B labels. This represents a 1,000x gain. To our knowledge, these are the first results to demonstrate that such large improvements are possible.