SNUUniversityUSCMay 25, 2026arXiv:2605.26104

EVIDENT: Routing MLLM Adaptation through Entity-Grounded Visual Evidence for Cross-Domain Video Temporal Grounding

Geo Ahn, Jiwook Han, Youngrae Kim, Joonseok Lee, Jinwoo Choi

AI Summary

This paper introduces EVIDENT, a parameter-efficient adaptation framework for Video Temporal Grounding (VTG) that improves cross-domain robustness by anchoring temporal grounding in the inherent entity-attention of pre-trained MLLMs. EVIDENT uses an Entity Bottleneck Adapter, Entity-Binding Distillation loss, and Entity-to-eVidence gating to guide the model to localize moments containing query-relevant entities. Experiments on cross-domain VTG benchmarks demonstrate that EVIDENT consistently improves out-of-domain robustness while maintaining in-domain performance.

Key Contribution

MLLMs struggle to generalize in Video Temporal Grounding not just due to unseen concepts, but because visual domain shift breaks their ability to link temporal localization with entity attention – a problem EVIDENT solves by explicitly routing adaptation through visual entity evidence.

Abstract

Fine-tuning MLLMs for Video Temporal Grounding (VTG) often improves in-domain performance but degrades sharply under domain shift. In this work, we find that this failure is primarily driven not just by unseen query concepts, but by visual domain shift, which prevents the model from coupling its learned temporal localization knowledge with its inherent entity-attention capability. To address this, we introduce EVIDENT, a parameter-efficient adaptation framework that anchors temporal grounding in the inherent entity-attention of pre-trained MLLMs by routing VTG adaptation through explicit visual entity evidence. EVIDENT consists of three components: (i) an Entity Bottleneck Adapter that transforms dense visual tokens into compact entity-level slots, (ii) an Entity-Binding Distillation loss that instills objectness priors into the semantically unstructured MLLM visual space, guiding each slot to bind to a coherent entity, and (iii) an Entity-to-eVidence gating mechanism that leverages the captured entities as evidence, steering the model to localize moments containing query-relevant entities. Together, these components enable VTG fine-tuning to rely on entity-grounded evidence rather than brittle dataset shortcuts. Experiments on cross-domain VTG benchmarks show that EVIDENT consistently improves out-of-domain robustness while preserving competitive in-domain performance with modest parameter overhead. These results suggest that entity-level grounding is an effective inductive bias for generalizable temporal localization.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EVIDENT: Routing MLLM Adaptation through Entity-Grounded Visual Evidence for Cross-Domain Video Temporal Grounding

Related Papers