Feb 16, 2026arXiv:2602.14425

Hierarchical Vision-Language Interaction for Facial Action Unit Detection

Yizhe Zhang, Wenhua Zhang, Tianyi Zhang, Muyun Jiang, Guo-Sen Xie, Cuntai Guan

AI Summary

The paper introduces Hierarchical Vision-language Interaction for AU Understanding (HiVA), a method that leverages textual AU descriptions generated by a large language model as semantic priors to improve facial action unit (AU) detection, especially under limited data conditions. HiVA uses an AU-aware dynamic graph module to learn AU-specific visual representations and a hierarchical cross-modal attention architecture (DDCA and CDCA) to capture both fine-grained and holistic vision-language associations. Experiments demonstrate that HiVA outperforms state-of-the-art methods and generates semantically meaningful activation patterns, indicating its effectiveness in learning robust cross-modal correspondences.

Key Contribution

Injecting LLM-generated textual descriptions of facial action units into a vision model substantially boosts AU detection performance, suggesting a powerful way to leverage language priors in computer vision.

Abstract

Facial Action Unit (AU) detection seeks to recognize subtle facial muscle activations as defined by the Facial Action Coding System (FACS). A primary challenge w.r.t AU detection is the effective learning of discriminative and generalizable AU representations under conditions of limited annotated data. To address this, we propose a Hierarchical Vision-language Interaction for AU Understanding (HiVA) method, which leverages textual AU descriptions as semantic priors to guide and enhance AU detection. Specifically, HiVA employs a large language model to generate diverse and contextually rich AU descriptions to strengthen language-based representation learning. To capture both fine-grained and holistic vision-language associations, HiVA introduces an AU-aware dynamic graph module that facilitates the learning of AU-specific visual representations. These features are further integrated within a hierarchical cross-modal attention architecture comprising two complementary mechanisms: Disentangled Dual Cross-Attention (DDCA), which establishes fine-grained, AU-specific interactions between visual and textual features, and Contextual Dual Cross-Attention (CDCA), which models global inter-AU dependencies. This collaborative, cross-modal learning paradigm enables HiVA to leverage multi-grained vision-based AU features in conjunction with refined language-based AU details, culminating in robust and semantically enriched AU detection capabilities. Extensive experiments show that HiVA consistently surpasses state-of-the-art approaches. Besides, qualitative analyses reveal that HiVA produces semantically meaningful activation patterns, highlighting its efficacy in learning robust and interpretable cross-modal correspondences for comprehensive facial behavior analysis.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Hierarchical Vision-Language Interaction for Facial Action Unit Detection

Related Papers