College of Software EngineeringLi AutoSchool of Computer Science and EngineeringSchool of Computing and InformationSchool of Cyber Science and EngineeringSEUUniversityZJUJun 1, 2026arXiv:2606.01934

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

Minghui Zheng, Hongxu Chen, Huimin Ren, Hongsheng Xin, Xiaoyang Qu, Ze Wang, Shuling Yang, Ziyu Peng, Kaike Zhang, Pan Zhou, Kun Zhan

AI Summary

This paper introduces HMPO, a novel single-stage reinforcement learning framework designed to optimize chain-of-thought (CoT) reasoning in large language models by compressing the inference process. By integrating an adaptive median-based budget, a cosine-decay token reward, and a multiplicative reward formulation, HMPO significantly enhances compression efficiency while maintaining accuracy across various tasks. The method achieves a remarkable 19%–46% token compression with minimal accuracy loss, demonstrating substantial cost savings over traditional multi-stage training approaches.

Key Contribution

Achieving up to 46% token compression without sacrificing accuracy, HMPO revolutionizes the efficiency of chain-of-thought reasoning in large language models.

Abstract

Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.

Reasoning & Chain-of-Thought RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

Related Papers