Search papers, labs, and topics across Lattice.
This survey paper analyzes the inference bottlenecks in large vision-language models (LVLMs), identifying visual token dominance as a key efficiency barrier due to high-resolution feature extraction, quadratic attention, and memory bandwidth limitations. It presents a taxonomy of efficiency techniques across the encoding, prefilling, and decoding stages, highlighting the interplay between upstream decisions and downstream bottlenecks. The paper concludes by outlining future research directions, including hybrid compression, modality-aware decoding, progressive state management, and stage-disaggregated serving.
Visual token dominance is the hidden culprit behind LVLM inference inefficiency, and this paper dissects the problem to reveal how to navigate the fidelity-efficiency tradeoff.
Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between high-resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency techniques structured around the inference lifecycle, consisting of encoding, prefilling, and decoding. Unlike prior reviews focused on isolated optimizations, we analyze the end-to-end pipeline to reveal how upstream decisions dictate downstream bottlenecks, covering compute-bound visual encoding, the intensive prefilling of massive contexts, and the''visual memory wall''in bandwidth-bound decoding. By decoupling the efficiency landscape into the axes of shaping information density, managing long-context attention, and overcoming memory limits, this work provides a structured analysis of how isolated optimizations compose to navigate the trade-off between visual fidelity and system efficiency. The survey concludes by outlining four future frontiers supported by pilot empirical insights, including hybrid compression based on functional unit sensitivity, modality-aware decoding with relaxed verification, progressive state management for streaming continuity, and stage-disaggregated serving through hardware-algorithm co-design. The submitted software contains a snapshot of our literature repository, which is designed to be maintained as a living resource for the community.