ZJUApr 28, 2026arXiv:2604.25618

Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding

Zhaoyan Pan, Hengyang Zhou, Xiangdong Li, Yuning Wang, Ye Lou, Jiatong Pan, Ji Zhou, Wei Zhang

AI Summary

This paper introduces CUCI-Net, a novel architecture for conversational multimodal understanding that explicitly models the dependency between dialogue context and current utterance as an "interpretation cue." CUCI-Net preserves the structural distinction between context and utterance during encoding, abstracts their dependency into this cue using both local modality and global contextual evidence, and integrates the cue into the multimodal interaction stage for context-conditioned prediction. Experiments on benchmark datasets demonstrate the effectiveness of CUCI-Net in improving multimodal understanding.

Key Contribution

Explicitly modeling the dependency between dialogue context and current utterance as an "interpretation cue" significantly boosts conversational multimodal understanding.

Abstract

Conversational multimodal understanding aims to infer the meaning or label of the current utterance from its preceding dialogue context together with textual, acoustic, and visual signals. Existing methods mainly strengthen contextual modeling through enhanced encoding, fusion, or propagation, but rarely abstract the context-utterance dependency into an explicit cue and incorporate it into later multimodal reasoning. To address this issue, we propose CUCI-Net for conversational multimodal understanding. CUCI-Net fully preserves the structural distinction between context and utterance during encoding, effectively abstracts their dependency into an interpretation cue by combining local modality evidence with global contextual evidence, and seamlessly integrates the resulting cue into the final multimodal interaction stage for context-conditioned prediction. Extensive experiments on mainstream benchmark datasets fully demonstrate the effectiveness of the proposed method.

Multimodal Models Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding

Related Papers