UTokyoJun 4, 2026arXiv:2606.06096

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Shota Takashiro, Soichiro Nishimori, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

AI Summary

This paper introduces OrderGrad, a novel family of gradient estimators that optimize order-statistic objectives instead of traditional expected returns in policy-gradient methods. By leveraging finite-sample L-statistics, OrderGrad enables the recovery of various distributional metrics such as Value at Risk (VaR) and trimmed means through a simple adjustment of rank weights. The empirical evaluation demonstrates that OrderGrad effectively addresses tasks where mean optimization fails, enhancing robustness and exploration in learning scenarios, such as LLM math post-training.

Key Contribution

OrderGrad transforms policy-gradient optimization by enabling precise control over distributional properties, allowing for risk-averse and exploratory learning in real-world applications.

Abstract

Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of likelihood-ratio and reparameterization gradient estimators for order-statistic objectives. OrderGrad optimizes finite-sample L-statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top-m/best-of-K criteria by changing only the rank weights. For any fixed sample size and rank-weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order-statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy-gradient or reparameterized update. We study the resulting estimator's variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post-training and other tasks. OrderGrad provides a unified, plug-and-play route to risk-averse, robust, and exploratory learning. Code: https://github.com/paavo5/ordergrad

Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

Related Papers