Search papers, labs, and topics across Lattice.
This paper investigates the impact of generator access on autoregressive post-training, specifically focusing on the ability to query the next-token rule from previously built prefixes versus being confined to fresh root-start rollouts. They show that root-start training is fundamentally limited by the on-policy probability of reaching informative prefixes, while weak prefix control overcomes this limitation. The study reveals an exponential gap in KL-regularized outcome-reward post-training performance simply by altering the generator interface.
Seemingly minor restrictions on generator access during post-training can create exponential gaps in performance, suggesting that the interface between learner and generator is a critical, often overlooked, factor.
We study how generator access constrains autoregressive post-training. The central question is whether the learner is confined to fresh root-start rollouts or can return to previously built prefixes and query the next-token rule there. In the root-start regime, output sampling, generated-token log probabilities, top-$k$ reports, and full next-token distributions along sampled trajectories all reduce to one canonical experiment, limited by the on-policy probability of reaching informative prefixes. Weak prefix control breaks this barrier, and once control is available, richer observations such as conditional sampling or logits can outperform top-$1$ access. Changing only the generator interface creates an exponential gap for KL-regularized outcome-reward post-training.