Search papers, labs, and topics across Lattice.
This paper introduces a parameter-efficient fine-tuning framework, LVLM-MIR, for Large Vision-Language Models (LVLMs) to tackle multimodal interleaved reasoning, using Qwen2.5-VL as the backbone and Low-Rank Adaptation (LoRA) for task-specific adaptation. The framework preprocesses multimodal inputs, extracts visual features with a modified Vision Transformer, performs cross-modal fusion via attention, and generates responses autoregressively. The method achieves an aggregate score of 0.7857 on the MIRAGE Challenge Track A Dataset, securing second place, while freezing pre-trained weights and fine-tuning LoRA adapters in visual and language modules.
Freezing most weights and only LoRA-tuning a vision-language model achieves near state-of-the-art multimodal interleaved reasoning performance, proving that targeted adaptation can rival full fine-tuning.
Multimodal interleaved reasoning, which requires models to understand interleaved image-text sequences and multiple images, is a critical challenge in contemporary AI. This paper proposes a parameter-efficient fine-tuning framework based on Large Vision-Language Models, with Qwen2.5-VL as the backbone and Low-Rank Adaptation for task-specific adaptation. The framework integrates four stages: multimodal input preprocessing to align with pre-training distributions, visual feature extraction via a modified Vision Transformer, cross-modal fusion via attention mechanisms, and response generation via an autoregressive decoder. By freezing pre-trained weights and fine-tuning low-rank adapters in both visual and language modules, it balances preserving general multimodal knowledge with optimizing target tasks, achieving high performance with low computational overhead. On the MIRAGE Challenge Track A Dataset, it performs strongly across subtasks, achieving an aggregate score of 0.7857 and securing second place in the challenge. Ablation studies confirm that joint LoRA fine-tuning of visual and language modules yields optimal results; limitations in fine-grained visual difference tasks indicate future directions in enhancing subtle feature capture and adaptive cross-modal alignment.