Mar 16, 2026arXiv:2603.15008

Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

Kaixin zhang, Xiaohe Li, Jiahao Li, Haohua Wu, Xinyu Zhao, Zide Fan, Lei Wang

AI Summary

The paper introduces ClueNet, a clue-aware video reasoning framework designed to improve VideoQA performance by explicitly structuring reasoning between visual perception and answer derivation in MLLMs. ClueNet employs a two-stage supervised fine-tuning paradigm with decoupled supervision for clue extraction and chain-based reasoning, along with an adaptive clue filter for high-order reasoning. Experiments on NExT-QA, STAR, and MVBench demonstrate that ClueNet outperforms state-of-the-art methods, exhibiting enhanced generalization, hallucination mitigation, and inference efficiency.

Key Contribution

Stop MLLM hallucinations in VideoQA: ClueNet's two-stage training and adaptive clue filtering boosts accuracy by 1.1% while improving interpretability and efficiency.

Abstract

Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and evidence-grounded answer generation. Prevailing end-to-end MLLM frameworks lack explicit structured reasoning between visual perception and answer derivation, causing severe hallucinations and poor interpretability. Existing methods also fail to address three core gaps: faithful visual clue extraction, utility-aware clue filtering, and end-to-end clue-answer alignment. Inspired by hierarchical human visual cognition, we propose ClueNet, a clue-aware video reasoning framework with a two-stage supervised fine-tuning paradigm without extensive base model modifications. Decoupled supervision aligns clue extraction and chain-based reasoning, while inference supervision with an adaptive clue filter refines high-order reasoning, alongside lightweight modules for efficient inference. Experiments on NExT-QA, STAR, and MVBench show that ClueNet outperforms state-of-the-art methods by $\ge$ 1.1%, with superior generalization, hallucination mitigation, inference efficiency, and cross-backbone compatibility. This work bridges the perception-to-generation gap in MLLM video understanding, providing an interpretable, faithful reasoning paradigm for high-stakes VideoQA applications.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

Related Papers