Apple MLEPFLFeb 16, 2026arXiv:2602.14868

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

Ilia Mahrooghi, Aryo Lotfi, Emmanuel Abbe

AI Summary

The paper introduces Goldilocks, a teacher-driven data sampling strategy for reinforcement learning that dynamically selects training examples of appropriate difficulty to improve sample efficiency in sparse reward settings. A teacher model predicts question difficulty based on the student's performance on previously seen samples, guiding the student towards questions that are neither too easy nor too hard. Experiments on the OpenMathReasoning dataset demonstrate that Goldilocks improves the performance of models trained with GRPO under a fixed compute budget.

Key Contribution

Key contribution not extracted.

Abstract

Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student's performance on seen samples, the teacher continuously adapts to the student's evolving abilities. On OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

Related Papers