Mar 2, 2026arXiv:2603.02023

PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking

He Li, Fei Song, Feichen Song, Boyi Zeng, Boyi Zeng, Shixiang Song, Z. Xu, Zhi-Qin John Xu, Ziwei He, Ziwei He, Zhouhan Lin, Zhouhan Lin

AI Summary

The paper introduces PonderLM-3, a pretraining framework that enables token-wise adaptive computation allocation in language models by learning to selectively apply additional computation during inference under self-supervised objectives. PonderLM-3 uses a differentiable attention mask during pretraining paired with a hard pruning rule at inference to ensure train-inference consistency, allowing tokens to receive more computation only when beneficial. Experiments demonstrate that PonderLM-3 achieves a stronger Pareto frontier, attaining lower pretraining perplexity at equal inference FLOPs compared to existing methods, and comparable downstream performance to fixed-step PonderLM-2 with fewer FLOPs.

Key Contribution

Stop wasting compute: PonderLM-3 learns to spend extra inference FLOPs only on the tokens that actually need them, outperforming fixed-step pondering methods.

Abstract

Test-time scaling has shown that allocating more additional computation at inference can improve generation quality, motivating a natural follow-up question: where should this computation be spent? Building on this insight, we introduce PonderLM-3, a pretraining framework for token-wise adaptive pondering that learns to selectively allocate additional computation under purely self-supervised objectives, built on top of the PonderLM-2 backbone. This makes additional inference computation an allocatable per-token resource, so tokens receive more computation only when it is beneficial, rather than paying a uniform extra cost. To make this allocation learnable while maintaining train-inference consistency, PonderLM-3 injects a differentiable attention mask during pretraining and pairs it with a matching hard pruning rule at inference. PonderLM-3 defines a stronger Pareto frontier: compared with existing recursive or adaptive baselines, it achieves lower pretraining perplexity at equal inference FLOPs. On downstream benchmarks, PonderLM-3 attains comparable performance to fixed-step PonderLM-2 under the same maximum number of additional computation steps, while using fewer inference FLOPs in practice. Overall, PonderLM-3 provides an end-to-end differentiable and train-inference consistent framework for token-wise adaptive computation, enabling additional inference compute to be allocated where it is most useful rather than paid uniformly by every token.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking

Related Papers