HuaweiZJUApr 7, 2026arXiv:2604.05650

See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

Yicheng Ji, Jinpeng Chen, Cong Wang, Lidan Shou, Gang Chen, Huan Li

AI Summary

This paper introduces LVSpec, a training-free loosely speculative decoding framework for Video-LLMs that accelerates inference by relaxing the exact-match constraints of existing speculative decoding methods. LVSpec identifies visual-relevant tokens that require strict matching and employs a position-shift tolerant mechanism to accept semantically equivalent but positionally mismatched tokens. Experiments on Qwen2.5-VL-32B and LLaVA-OneVision-72B show that LVSpec preserves >99.8% of target performance while achieving speedups of 2.70x and 2.94x, respectively.

Key Contribution

Video-LLMs can be sped up by nearly 3x without sacrificing performance, simply by loosening the strict matching requirements of speculative decoding and focusing on visual-semantic relevance.

Abstract

Video Large Language Models (Video-LLMs) excel in video understanding but suffer from high inference latency during autoregressive generation. Speculative Decoding (SD) mitigates this by applying a draft-and-verify paradigm, yet existing methods are constrained by rigid exact-match rules, severely limiting the acceleration potential. To bridge this gap, we propose LVSpec, the first training-free loosely SD framework tailored for Video-LLMs. Grounded in the insight that generation is governed by sparse visual-relevant anchors (mandating strictness) amidst abundant visual-irrelevant fillers (permitting loose verification), LVSpec employs a lightweight visual-relevant token identification scheme to accurately pinpoint the former. To further maximize acceptance, we augment this with a position-shift tolerant mechanism that effectively salvages positionally mismatched but semantically equivalent tokens. Experiments demonstrate that LVSpec achieves high fidelity and speed: it preserves>99.8 of target performance while accelerating Qwen2.5-VL-32B by 2.70x and LLaVA-OneVision-72B by 2.94x. Notably, it boosts the mean accepted length and speedup ratio by 136% and 35% compared to SOTA training-free SD methods for Video-LLMs.

Computer Vision Inference & Quantization Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

Related Papers