CMU MLFeb 25, 2026arXiv:2602.21492

GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning

Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang

AI Summary

The paper introduces GradAlign, a data selection method for LLM reinforcement learning that addresses the sensitivity of RL to training problem quality by prioritizing problems whose policy gradients align with gradients from a small, trusted validation set. This approach creates an adaptive curriculum that mitigates issues arising from non-stationary RL environments, such as unreliable reward signals and distribution imbalance. Experiments across these challenging data regimes demonstrate that GradAlign outperforms existing baselines, leading to more stable training and improved final performance.

Key Contribution

Forget manual curation—aligning policy gradients with a validation set adaptively selects RL training data, leading to more stable LLM training and improved performance.

Abstract

Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non-stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine-tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low-utility problems. We propose GradAlign, a gradient-aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low-utility training corpus, showing that GradAlign consistently outperforms existing baselines, underscoring the importance of directional gradient signals in navigating non-stationary policy optimization and yielding more stable training and improved final performance. We release our implementation at https://github.com/StigLidu/GradAlign

Data Curation & Synthetic Data RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning

Related Papers