JilinUMDUNCJun 2, 2026arXiv:2606.03102

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Runpeng Dai, Runpeng Dai, Tong Zheng, Tong Zheng, Rui Liu, Chengsong Huang, Hongtu Zhu

AI Summary

This paper introduces a novel approach to adaptive sampling for large language models by framing it as a Markov decision process (MDP) and training a lightweight reinforcement learning (RL) controller. This method effectively balances answer correctness, latency, and computation cost, addressing the inefficiencies of existing heuristic-based sampling techniques. Experimental results demonstrate that the RL-guided controller outperforms strong baselines, achieving superior trade-offs in sampling efficiency and answer quality.

Key Contribution

Reinforcement learning can significantly enhance adaptive sampling in large language models, leading to better performance with fewer resources.

Abstract

Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP). We train a lightweight sampling controller with reinforcement learning (RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional samples. Our method is lightweight which only relies on statistics of final answers, and can be trained and deployed on CPU. We further show that the resulting framework admits an interpretation as the Lagrangian relaxation of a constrained optimization problem with explicit budget constraints. Experiments against strong baselines such as ASC and ESC show that our method achieves improved trade-offs among answer correctness, sampling rounds, and total samples required.

Reasoning & Chain-of-Thought RLHF & Preference Learning Scaling Laws & Emergent Abilities

Citation Metrics

Citations0

Influential citations0

References51

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Related Papers