CambridgeMistralJun 8, 2026arXiv:2606.09380

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

Han Zhou, Adam X. Yang, Laurence Aitchison, Anna Korhonen, Albert Q. Jiang

AI Summary

This paper introduces Reasoning Arena, an adaptive training framework designed to enhance the reasoning capabilities of large language models by addressing the limitations of verifiable rewards in reinforcement learning. By constructing trace tournaments that compare reasoning traces head-to-head, the framework generates rich relative reward signals from otherwise uninformative reward groups. Empirical evaluations show that Reasoning Arena improves performance by 7.6% over traditional RLVR methods while accelerating training efficiency by up to 41% and reducing generation compute by nearly 50%.

Key Contribution

Transforming uninformative reward signals into actionable insights, Reasoning Arena boosts reasoning performance while slashing training costs.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.

Reasoning & Chain-of-Thought RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

Related Papers