B. Topic Samples Data source(s)HKUIndependentIndiana UniversityOhio StatePKUUMichMar 16, 2026arXiv:2603.14989

MMSpec: Benchmarking Speculative Decoding for Vision-Language Models

Hui Shen, Xin Wang, Ping Zhang, Yunta Hsieh, Qi Han, Zhongwei Wan, Ziheng Zhang, Jingxuan Zhang, Jing Xiong, Ziyuan Liu, Yifan Zhang, Hangrui Cao, Chenyang Zhao, Mi Zhang

AI Summary

The paper introduces MMSpec, a benchmark to evaluate speculative decoding techniques for vision-language models (VLMs), highlighting the limitations of text-centric methods in multimodal contexts. Through MMSpec, the authors identify the increasing importance of vision awareness in speculative decoding, especially at larger batch sizes, and demonstrate that throughput speedup is not always indicative of latency reduction. To address these issues, they propose ViSkip, a novel speculative decoding method that dynamically adapts speculation to vision tokens, achieving state-of-the-art performance on the MMSpec benchmark.

Key Contribution

Text-based speculative decoding falls flat for vision-language models, but ViSkip dynamically adapts to vision tokens for state-of-the-art acceleration.

Abstract

Vision-language models (VLMs) achieve strong performance on multimodal tasks but suffer from high inference latency due to large model sizes and long multimodal contexts. Speculative decoding has recently emerged as an effective acceleration technique, yet its behavior in VLMs remains insufficiently understood. We introduce MMSpec, the first benchmark for evaluating speculative decoding in vision-language models. MMSpec contains 600 multimodal samples across six task categories and integrates ten representative speculative decoding algorithms under a unified evaluation framework. Our study reveals three key findings: (1) methods designed for text-only LLMs degrade in multimodal scenarios, (2) vision awareness becomes increasingly important at larger batch sizes, and (3) throughput speedup alone does not reliably reflect latency performance. Motivated by these findings, we propose ViSkip, a plug-and-play speculative decoding method that dynamically adapts speculation to vision tokens and achieves state-of-the-art performance.

Eval Frameworks & Benchmarks Inference & Quantization Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MMSpec: Benchmarking Speculative Decoding for Vision-Language Models

Related Papers