HITMar 4, 2026arXiv:2603.04597

Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

Lei Huang, Xiang Cheng, Chenxiao Zhao, Guobin Shen, Junjie Yang, Xiaocheng Feng, Yuxuan Gu, Xing Yu, Bing Qin

AI Summary

This paper introduces GOLF, a reinforcement learning framework that leverages group-level natural language feedback to guide exploration in sparse-reward environments. GOLF aggregates external critiques and intra-group attempts to produce high-quality refinements, which are then injected into training as off-policy scaffolds. Experiments on verifiable and non-verifiable benchmarks demonstrate that GOLF achieves significant improvements in sample efficiency, outperforming RL methods trained solely on scalar rewards by 2.2x.

Key Contribution

RL agents can learn far more efficiently by incorporating group-level natural language feedback, achieving 2.2x sample efficiency gains in sparse-reward environments.

Abstract

Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2$\times$ improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.

Natural Language Processing RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

Related Papers