Search papers, labs, and topics across Lattice.
The paper addresses the visual feature bottleneck in Vision-Language Models (VLMs) caused by connecting only the final vision encoder output to the LLM input. They introduce Cross-Layer Injection (CLI), a framework with Adaptive Multi-Projection (AMP) to harmonize features from different vision layers and Adaptive Gating Fusion (AGF) to allow the LLM to selectively inject visual information based on decoding context. Experiments integrating CLI into LLaVA-OneVision and LLaVA-1.5 across 18 benchmarks show significant performance gains, demonstrating improved multimodal understanding.
VLMs can now access the full visual hierarchy on-demand, thanks to a new cross-layer injection method that dynamically bridges vision encoders and LLMs.
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.