Qualcomm AIVectorYorkJun 8, 2026arXiv:2606.09547

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

Apratim Bhattacharyya, Shweta Mahajan, Sanjay Haresh, Rajeev Yasarla, Reza Pourreza, Litian Liu, Risheek Garrepalli, Roland Memisevic

AI Summary

This study evaluates the capability of video large language models (LLMs) to provide proactive task guidance in cooking scenarios by introducing Ego-MC-Bench, a benchmark designed to assess reactive mistake correction. The authors highlight the challenge posed by the scarcity of training data that includes mistakes and timely interventions, which hinders the performance of state-of-the-art models. By creating Ego-CoMist, a synthetic dataset that transforms non-interactive cooking videos into examples of proactive guidance, they demonstrate significant performance improvements in smaller, efficient video LLMs suitable for edge devices.

Key Contribution

Video LLMs struggle to correct mistakes in real-time cooking tasks, but a new synthetic dataset can dramatically enhance their performance.

Abstract

Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.

Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

Related Papers