AccentureHKUNUDTAug 11, 2025arXiv:2508.08192

Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions

Bangsheng Tang, Carl Chengyan Fu, Fei Kou, Grigory Sizov, Haoci Zhang, Jason Park, Jiawen Liu, Jie You, Qirui Yang, Sachin Mehta, Shengyong Cai, Xiaodong Wang, Xingyu Liu, Yunlu Li, Yanjun Zhou, Wei Wei, Zhiwei Zhao, Zixi Qi, Adolfo Victoria, Aya Ibrahim, Bram Wasti, Changkyu Kim, Daniel Haziza, Fei Sun, Giancarlo Delfin, Emily Guo, Jialin Ouyang, Jaewon Lee, Jianyu Huang, J. Reizenstein, Lu Fang, Quinn Zhu, Ria Verma, Vlad T. Mihailescu, Xin-Ru Guo, Yan-Hong Cui, Ye Hu, Yejin Lee

AI Summary

This paper addresses the engineering challenges of scaling speculative decoding for Llama models in production environments, focusing on efficient GPU implementation of tree attention and multi-round speculative decoding. The authors present training and inference optimization techniques based on EAGLE to achieve state-of-the-art inference latency. Their optimized Llama4 Maverick decodes at approximately 4 ms per token (batch size 1) on 8 NVIDIA H100 GPUs, a 10% improvement over prior methods, and achieves 1.4x-2.0x speedup for large batch sizes.

Key Contribution

Speculative decoding for Llama just got 10% faster, thanks to production-scale optimizations that unlock new levels of inference efficiency.

Abstract

Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale.

Distributed Systems & Hardware Inference & Quantization Open-Source Models & Weights

Citation Metrics

Citations4

Influential citations0

References31

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions

Related Papers