Tsinghua AIMay 25, 2026arXiv:2605.25952

VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

Yinghao Wu, Zhuoyan Luo, Zhaojian Yu, Xiao-Ping Zhang

AI Summary

The paper introduces VEN-VL, a visual ensemble Mixture-of-Experts (MoE) framework, to improve the performance of efficient multimodal models by addressing information bottlenecks caused by aggressive visual token compression. VEN-VL enriches visual representations by unifying different perspectives and then compacts them using adaptive routers within specialized visual experts. By incorporating explicit visual supervision via reconstruction, VEN-VL preserves crucial information, leading to improved performance on complex visual tasks with few tokens.

Key Contribution

Compressing visual tokens doesn't have to mean sacrificing performance: VEN-VL's ensemble-MoE approach recovers accuracy while maintaining efficiency in multimodal models.

Abstract

Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from noticeable performance degradation. Their emphasis on the high compression ratio of a single visual clue and reliance on the heuristic pruning strategy with coarse attention alignment incurs a bottleneck on the information capacity and density of visual tokens. Addressing this limitation, we propose VEN-VL, a visual ensemble MoE framework for effective and efficient perception following the enrich then compact principle. Specifically, we first enrich the information capacity by unifying the visual representations of different perspectives, and then progressively compact it with adaptive routers in specialized visual experts to enhance the information density. Furthermore, we incorporate the reconstruction ability of vanilla structure via explicit visual supervision, facilitating crucial information preservation. Experimental results demonstrate our superiority in complex visual tasks with few information-condensed tokens, which effectively bridges the gap between performance and efficiency.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

Related Papers