Search papers, labs, and topics across Lattice.
This paper introduces second-order rollout, generating multiple critiques for a response, to improve RL training data utilization for LLMs by jointly training generation and critique capabilities. By incorporating critique generation, the method addresses the limitation of vanilla RL, which primarily focuses on improving generation capabilities using only first-order rollouts. Experiments across models and datasets demonstrate that the proposed approach achieves better performance with the same training data and reveals insights into critique training, such as the importance of label balance and mitigation of noise in outcome-based rewards.
LLMs trained with a novel "second-order rollout" that generates critiques in addition to responses learn more effectively from the same data, unlocking better reasoning.
Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple responses for a question), and we argue that this approach fails to fully exploit the potential of training data because of the neglect of critique capability training. To tackle this problem, we further introduce the concept of second-order rollout (generating multiple critiques for a response) and propose a unified framework for jointly training generation and critique capabilities. Extensive experiments across various models and datasets demonstrate that our approach can utilize training data more effectively than vanilla RL and achieve better performance under the same training data. Additionally, we uncover several insightful findings regarding second-order rollout and critique training, such as the importance of label balance in critique training and the noise problem of outcome-based rewards, which can be mitigated through sampling techniques. Our work offers a preliminary exploration of dynamic data augmentation and joint generation-critique training in RL, providing meaningful inspiration for the further advancement of RL training