Mar 3, 2026arXiv:2603.02748

iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding

HanZpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He

AI Summary

The paper introduces iGVLM, a framework that enhances LVLMs by incorporating instruction-guided visual modulation using a dual-branch architecture with a frozen representation branch and a dynamic conditioning branch employing AdaLN. This approach addresses the limitation of static vision encoders in existing LVLMs, which struggle with task-specific visual reasoning. iGVLM demonstrates improved instruction sensitivity across various language backbones and is evaluated using a newly introduced diagnostic benchmark, MM4, which tests logical consistency under multi-query, multi-instruction settings.

Key Contribution

Instruction-guided visual modulation with iGVLM unlocks more fine-grained reasoning in LVLMs, outperforming static vision encoders by dynamically adapting visual representations to the specific textual task.

Abstract

Despite the success of Large Vision--Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural integrity and stability of pre-trained visual priors. Beyond standard benchmarks, we introduce MM4, a controlled diagnostic probe for quantifying logical consistency under multi-query, multi-instruction settings. Extensive results show that iGVLM consistently enhances instruction sensitivity across diverse language backbones, offering a plug-and-play paradigm for bridging passive perception and active reasoning.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding

Related Papers