Feb 20, 2026arXiv:2602.18292

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

Xiaotong Ji, Rasul Tutunov, Matthieu Zimmer, Haitham Bou-Ammar

AI Summary

This paper reframes decoding in language models as a principled optimization problem on the probability simplex, where the objective balances model score with structural preferences. The authors demonstrate that common decoding strategies like greedy decoding, Top-K, and Top-P can be derived as special cases within this framework. They introduce Best-of-K (BoK), a novel KL-anchored coverage objective for multi-sample pipelines, and show that it improves accuracy on tasks like MATH500.

Key Contribution

Forget heuristic knob-tuning: decoding algorithms are just regularized optimization problems hiding in plain sight.

Abstract

Decoding sits between a language model and everything we do with it, yet it is still treated as a heuristic knob-tuning exercise. We argue decoding should be understood as a principled optimisation layer: at each token, we solve a regularised problem over the probability simplex that trades off model score against structural preferences and constraints. This single template recovers greedy decoding, Softmax sampling, Top-K, Top-P, and Sparsemax-style sparsity as special cases, and explains their common structure through optimality conditions. More importantly, the framework makes it easy to invent new decoders without folklore. We demonstrate this by designing Best-of-K (BoK), a KL-anchored coverage objective aimed at multi-sample pipelines (self-consistency, reranking, verifier selection). BoK targets the probability of covering good alternatives within a fixed K-sample budget and improves empirical performance. We show that such samples can improve accuracy by, for example, +18.6% for Qwen2.5-Math-7B on MATH500 at high sampling temperatures.

Inference & Quantization Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References49

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

Related Papers