Mar 18, 2026arXiv:2603.17942

Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing

Raghavv Goel, Mukul Gagrani, Mingu Lee, Christopher M. Lott, Chris Lott

AI Summary

The paper introduces a training-free multi-token prediction (MTP) method that leverages on-the-fly mask tokens sampled from the LLM's embedding space to enable parallel prediction of future tokens. By constructing a speculative token tree and applying a pruning strategy, the method achieves lossless generation with fewer model calls. Experiments on LLaMA3 and Qwen3 demonstrate that this probing-based MTP outperforms existing training-free baselines, increasing acceptance length by up to 12% and throughput by up to 19%.

Key Contribution

LLMs can predict multiple tokens in parallel without any training, simply by cleverly probing their embedding space with dynamically generated mask tokens.

Abstract

Large language models (LLMs) exhibit latent multi-token prediction (MTP) capabilities despite being trained solely for next-token generation. We propose a simple, training-free MTP approach that probes an LLM using on-the-fly mask tokens drawn from its embedding space, enabling parallel prediction of future tokens without modifying model weights or relying on auxiliary draft models. Our method constructs a speculative token tree by sampling top-K candidates from mask-token logits and applies a lightweight pruning strategy to retain high-probability continuations. During decoding, candidate predictions are verified in parallel, resulting in lossless generation while substantially reducing the number of model calls and improving token throughput. Across benchmarks, our probing-based MTP consistently outperforms existing training-free baselines, increasing acceptance length by approximately 12\% on LLaMA3 and 8--12\% on Qwen3, and achieving throughput gains of up to 15--19\%. Finally, we provide theoretical insights and empirical evidence showing that decoder layers naturally align mask-token representations with next-token states, enabling accurate multi-step prediction without retraining or auxiliary models.

Inference & Quantization Natural Language Processing Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References39

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing

Related Papers