Search papers, labs, and topics across Lattice.
This paper introduces a theoretical framework, based on particle filtering and Sequential Monte Carlo (SMC), to analyze the accuracy-cost tradeoffs of inference-time sampling methods for large language models. The authors establish non-asymptotic guarantees for SMC under specific criteria, propose algorithmic improvements, and identify a fundamental limit inherent to particle filtering. Empirical results validate the theoretical criteria's influence on sampling error but highlight the need for additional theoretical considerations to fully explain final accuracy.
Particle filtering reveals a fundamental limit to inference-time sampling methods for LLMs, suggesting that simply increasing the number of samples has diminishing returns.
Inference-time methods that aggregate and prune multiple samples have emerged as a powerful paradigm for steering large language models, yet we lack any principled understanding of their accuracy-cost tradeoffs. In this paper, we introduce a route to rigorously study such approaches using the lens of *particle filtering* algorithms such as Sequential Monte Carlo (SMC). Given a base language model and a *process reward model* estimating expected terminal rewards, we ask: *how accurately can we sample from a target distribution given some number of process reward evaluations?* Theoretically, we identify (1) simple criteria enabling non-asymptotic guarantees for SMC; (2) algorithmic improvements to SMC; and (3) a fundamental limit faced by all particle filtering methods. Empirically, we demonstrate that our theoretical criteria effectively govern the *sampling error* of SMC, though not necessarily its final *accuracy*, suggesting that theoretical perspectives beyond sampling may be necessary.