Tsinghua AIECNUHebei University of Science and TechnologyMar 4, 2026arXiv:2603.03827

Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition

Qianrui Zhou, Hua Xu, Yunjin Gu, Yifan Wang, Yifan Wang, Songze Li, Hanlei Zhang

AI Summary

The paper introduces HIER, a novel approach for multimodal intent recognition that leverages hierarchical semantic representation and evolutionary reasoning within a Multimodal Large Language Model (MLLM). HIER organizes multimodal semantics into three levels: modality-specific tokens, clustered mid-level semantic concepts, and higher-order inter-concept relations selected using JS divergence. A self-evolution mechanism refines these representations through MLLM feedback, leading to state-of-the-art performance on three benchmarks with 1-3% gains.

Key Contribution

Achieve state-of-the-art multimodal intent recognition by structuring semantics into progressively abstracted levels and dynamically refining representations through MLLM feedback.

Abstract

Multimodal intent recognition aims to infer human intents by jointly modeling various modalities, playing a pivotal role in real-world dialogue systems. However, current methods struggle to model hierarchical semantics underlying complex intents and lack the capacity for self-evolving reasoning over multimodal representations. To address these issues, we propose HIER, a novel method that integrates HIerarchical semantic representation with Evolutionary Reasoning based on Multimodal Large Language Model (MLLM). Inspired by human cognition, HIER introduces a structured reasoning paradigm that organizes multimodal semantics into three progressively abstracted levels. It starts with modality-specific tokens capturing localized semantic cues, which are then clustered via a label-guided strategy to form mid-level semantic concepts. To capture higher-order structure, inter-concept relations are selected using JS divergence scores to highlight salient dependencies across concepts. These hierarchical representations are then injected into MLLM via CoT-driven prompting, enabling step-wise reasoning. Besides, HIER utilizes a self-evolution mechanism that refines semantic representations through MLLM feedback, allowing dynamic adaptation during inference. Experiments on three challenging benchmarks show that HIER consistently outperforms state-of-the-art methods and MLLMs with 1-3% gains across all metrics. Code and more results are available at https://github.com/thuiar/HIER.

Multimodal Models Natural Language Processing Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References54

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition

Related Papers