Apr 16, 2026arXiv:2604.14853

Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization

Zhiyuan Zhai, Bingcong Li, Bingnan Xiao, Xin Wang

AI Summary

This paper addresses the problem of allocating compute budgets during LLM inference by formulating it as a constrained optimization problem, maximizing accuracy under a fixed compute budget. They propose a two-stage "Solve-then-Learn" pipeline: first, a Lagrangian relaxation approach determines the optimal compute allocation for each input instance; then, a classifier learns to predict these optimal allocations from input features. Experiments on MATH and GSM8K demonstrate significant accuracy improvements compared to uniform or heuristic allocation strategies, achieving up to 12.8% relative accuracy gain on MATH.

Key Contribution

Stop wasting compute: a learned policy can intelligently allocate LLM inference budgets, boosting accuracy by up to 12.8% compared to uniform allocation.

Abstract

Test-time compute scaling, the practice of spending extra computation during inference via repeated sampling, search, or extended reasoning, has become a powerful lever for improving large language model performance. Yet deploying these techniques under finite inference budgets requires a decision that current systems largely ignore: which inputs deserve more compute, and which can be answered cheaply? We formalize this as a constrained optimization problem (maximize expected accuracy subject to an average compute budget) and solve it with a two-stage Solve-then-Learn pipeline. In the solve stage, Lagrangian relaxation decomposes the global constraint into per-instance sub-problems, each admitting a closed-form oracle action that optimally prices accuracy against cost. We prove that the induced cost is monotone in the dual variable, enabling exact budget targeting via binary search. In the learn stage, a lightweight classifier is trained to predict oracle actions from cheap input features, amortizing the allocation rule for real-time deployment. We establish that the task-level regret of the learned policy is bounded by its imitation error times the worst-case per-instance gap, yielding a clean reduction from constrained inference to supervised classification. Experiments on MATH and GSM8K with three LLMs (DeepSeek-V3, GPT-4o-mini, Qwen2.5-7B) show that our method consistently outperforms uniform and heuristic allocation baselines, achieving up to 12.8% relative accuracy improvement on MATH under matched budget constraints, while closely tracking the Lagrangian oracle upper bound with over 91% imitation accuracy.

Eval Frameworks & Benchmarks Inference & Quantization Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization

Related Papers