Search papers, labs, and topics across Lattice.
The paper introduces Hierarchical Vision-language Interaction for AU Understanding (HiVA), a method that leverages textual AU descriptions generated by a large language model as semantic priors to improve facial action unit (AU) detection, especially under limited data conditions. HiVA uses an AU-aware dynamic graph module to learn AU-specific visual representations and a hierarchical cross-modal attention architecture (DDCA and CDCA) to capture both fine-grained and holistic vision-language associations. Experiments demonstrate that HiVA outperforms state-of-the-art methods and generates semantically meaningful activation patterns, indicating its effectiveness in learning robust cross-modal correspondences.
Injecting LLM-generated textual descriptions of facial action units into a vision model substantially boosts AU detection performance, suggesting a powerful way to leverage language priors in computer vision.
Facial Action Unit (AU) detection seeks to recognize subtle facial muscle activations as defined by the Facial Action Coding System (FACS). A primary challenge w.r.t AU detection is the effective learning of discriminative and generalizable AU representations under conditions of limited annotated data. To address this, we propose a Hierarchical Vision-language Interaction for AU Understanding (HiVA) method, which leverages textual AU descriptions as semantic priors to guide and enhance AU detection. Specifically, HiVA employs a large language model to generate diverse and contextually rich AU descriptions to strengthen language-based representation learning. To capture both fine-grained and holistic vision-language associations, HiVA introduces an AU-aware dynamic graph module that facilitates the learning of AU-specific visual representations. These features are further integrated within a hierarchical cross-modal attention architecture comprising two complementary mechanisms: Disentangled Dual Cross-Attention (DDCA), which establishes fine-grained, AU-specific interactions between visual and textual features, and Contextual Dual Cross-Attention (CDCA), which models global inter-AU dependencies. This collaborative, cross-modal learning paradigm enables HiVA to leverage multi-grained vision-based AU features in conjunction with refined language-based AU details, culminating in robust and semantically enriched AU detection capabilities. Extensive experiments show that HiVA consistently surpasses state-of-the-art approaches. Besides, qualitative analyses reveal that HiVA produces semantically meaningful activation patterns, highlighting its efficacy in learning robust and interpretable cross-modal correspondences for comprehensive facial behavior analysis.