Search papers, labs, and topics across Lattice.
The paper introduces PonderLM-3, a pretraining framework that enables token-wise adaptive computation allocation in language models by learning to selectively apply additional computation during inference under self-supervised objectives. PonderLM-3 uses a differentiable attention mask during pretraining paired with a hard pruning rule at inference to ensure train-inference consistency, allowing tokens to receive more computation only when beneficial. Experiments demonstrate that PonderLM-3 achieves a stronger Pareto frontier, attaining lower pretraining perplexity at equal inference FLOPs compared to existing methods, and comparable downstream performance to fixed-step PonderLM-2 with fewer FLOPs.
Stop wasting compute: PonderLM-3 learns to spend extra inference FLOPs only on the tokens that actually need them, outperforming fixed-step pondering methods.
Test-time scaling has shown that allocating more additional computation at inference can improve generation quality, motivating a natural follow-up question: where should this computation be spent? Building on this insight, we introduce PonderLM-3, a pretraining framework for token-wise adaptive pondering that learns to selectively allocate additional computation under purely self-supervised objectives, built on top of the PonderLM-2 backbone. This makes additional inference computation an allocatable per-token resource, so tokens receive more computation only when it is beneficial, rather than paying a uniform extra cost. To make this allocation learnable while maintaining train-inference consistency, PonderLM-3 injects a differentiable attention mask during pretraining and pairs it with a matching hard pruning rule at inference. PonderLM-3 defines a stronger Pareto frontier: compared with existing recursive or adaptive baselines, it achieves lower pretraining perplexity at equal inference FLOPs. On downstream benchmarks, PonderLM-3 attains comparable performance to fixed-step PonderLM-2 under the same maximum number of additional computation steps, while using fewer inference FLOPs in practice. Overall, PonderLM-3 provides an end-to-end differentiable and train-inference consistent framework for token-wise adaptive computation, enabling additional inference compute to be allocated where it is most useful rather than paid uniformly by every token.