Search papers, labs, and topics across Lattice.
This paper introduces HMPO, a novel single-stage reinforcement learning framework designed to optimize chain-of-thought (CoT) reasoning in large language models by compressing the inference process. By integrating an adaptive median-based budget, a cosine-decay token reward, and a multiplicative reward formulation, HMPO significantly enhances compression efficiency while maintaining accuracy across various tasks. The method achieves a remarkable 19%–46% token compression with minimal accuracy loss, demonstrating substantial cost savings over traditional multi-stage training approaches.
Achieving up to 46% token compression without sacrificing accuracy, HMPO revolutionizes the efficiency of chain-of-thought reasoning in large language models.
Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.