HuggingFaceJan 16, 2026arXiv:2601.11464

MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

Xiaoran Fan, Zhichao Sun, Tao Ji, Lixing Shen, Tao Gui

AI Summary

The paper introduces MHA2MLA-VLM, a parameter-efficient framework for converting existing Vision-Language Models (VLMs) to Multi-Head Latent Attention (MLA) architectures without pretraining, addressing the KV cache bottleneck during VLM inference. MHA2MLA-VLM employs modality-adaptive partial-RoPE and modality-decoupled low-rank approximation to compress the visual and textual KV spaces. Experiments on three VLMs demonstrate that the framework restores original performance with minimal fine-tuning, significantly reduces KV cache size, and integrates with KV quantization.

Key Contribution

Retrofit your VLMs with Multi-Head Latent Attention (MLA) for faster inference and smaller memory footprint, without costly pretraining, using this parameter-efficient conversion framework.

Abstract

As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present MHA2MLA-VLM, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce parameter-efficient fine-tuning to minimize adaptation cost and demonstrate that minimizing output activation error, rather than parameter distance, substantially reduces performance loss. Extensive experiments on three representative VLMs show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Multimodal Models

Citation Metrics

Citations0

Influential citations0

References32

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

Related Papers