Feb 22, 2026arXiv:2602.19146

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Diogo Glória-Silva, David Semedo, João Maglhães

AI Summary

The paper introduces VIGiA, a multimodal dialogue model for reasoning about instructional videos by aligning user queries with task plans. VIGiA integrates multimodal plan reasoning and plan-based retrieval to provide grounded, plan-aware dialogue. Experiments on a new dataset of instructional video dialogues for cooking and DIY tasks demonstrate that VIGiA achieves over 90% accuracy on plan-aware visual question answering, outperforming existing state-of-the-art models.

Key Contribution

Finally, a multimodal dialogue model that doesn't just talk about instructional videos, but actually understands and reasons about the visual steps involved, blowing away previous text-only approaches.

Abstract

We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Related Papers