NJUShanghai AI LabShanghai InnovationSJTUJun 10, 2026arXiv:2606.12195

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Ziang Yan, Sheng Xia, Jiashuo Yu, Yue Wu, Tianxiang Jiang, Kanghui Tian, Yicheng Xu, Yinan He, Kai Chen, Yu Qiao, Yi Wang

AI Summary

This paper introduces InternVideo3, a framework that enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning (MCR), which treats understanding as a closed-loop process over an evolving context. The authors address the limitations of existing open-source models that focus primarily on text, demonstrating that their approach significantly improves performance on video benchmarks such as Video-MME and MLVU. Key findings reveal that efficient context handling and iterative reasoning are crucial for developing agentic behaviors in multimodal settings, particularly in video tasks requiring sustained temporal understanding.

Key Contribution

Efficient context handling in video tasks can elevate multimodal models to new heights of agency and reasoning capability.

Abstract

Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Related Papers