Mar 19, 2026arXiv:2603.18533

Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

Yinan Xia, Haotian Zhang, Huimin Wang, Huiming Wang

AI Summary

This paper introduces Difficulty-Differentiated Policy Optimization (DDPO), a reinforcement learning algorithm designed to mitigate overthinking in Large Reasoning Models (LRMs) by tailoring output length to task difficulty. DDPO optimizes policies separately for simple and complex tasks, reducing length for simple tasks and expanding exploration for complex ones, guided by theoretical conditions for maximizing expected accuracy. Experiments demonstrate that DDPO reduces average answer length by 12% and improves accuracy by 1.85% compared to GRPO, showcasing a better accuracy-length trade-off.

Key Contribution

LRMs can be made more efficient and accurate by strategically adjusting their output length based on task difficulty, leading to a better accuracy-length trade-off.

Abstract

Large Reasoning Models (LRMs) have shown exceptional reasoning capabilities, but they also suffer from the issue of overthinking, often generating excessively long and redundant answers. For problems that exceed the model's capabilities, LRMs tend to exhibit the overconfidence phenomenon, generating overly short but incorrect answers, which may contribute to suboptimal performance. To address these issues, we propose Difficulty-Differentiated Policy Optimization (DDPO), an efficient reinforcement learning algorithm that optimizes simple and complex tasks separately based on the overconfidence phenomenon. Specifically, it reduces the output length for simple tasks without compromising accuracy, while for complex tasks, it expands the exploration space to improve performance. We further derive the theoretical conditions for maximizing expected accuracy, which require the length distribution to closely approximate the optimal length and be as concentrated as possible. Based on these conditions, we propose using the difficulty-level average as a well-founded reference for length optimization. Extensive experiments on both in-domain and out-of-domain benchmarks validate the superiority and effectiveness of DDPO. Compared to GRPO, DDPO reduces the average answer length by 12% while improving accuracy by 1.85% across multiple benchmarks, achieving a better trade-off between accuracy and length. The code is available at https://github.com/Yinan-Xia/DDPO.

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References44

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

Related Papers