OpenAIBilibili Inc.College of Computer ScienceCollege of Computer Science VCIPSchool of Software Tiangong UniversityMar 17, 2026arXiv:2603.16259

Hyperbolic Multimodal Generative Representation Learning for Generalized Zero-Shot Multimodal Information Extraction

Baohang Zhou, Kehui Song, Rize Jin, Yu Zhao, Xuhui Sui, Xinying Qian, Xingyue Guo, Ying Zhang

AI Summary

This paper introduces Hyperbolic Multimodal Generative Representation Learning (HMGRL) to tackle generalized zero-shot multimodal information extraction (GZS-MIE), which requires extracting both seen and unseen entity types and relations from web text and images. HMGRL uses hyperbolic space to model hierarchical semantic relationships between modalities and categories, and it employs a variational information bottleneck and autoencoder network. By training with generated unseen samples and using a semantic similarity distribution alignment loss, HMGRL outperforms existing baselines on two benchmark datasets.

Key Contribution

Existing zero-shot multimodal information extraction models struggle with real-world scenarios containing both seen and unseen categories, but this work solves it by modeling hierarchical semantic relationships in hyperbolic space and aligning semantic similarity distributions.

Abstract

Multimodal information extraction (MIE) constitutes a set of essential tasks aimed at extracting structural information from Web texts with integrating images, to facilitate the structural construction of Web-based semantic knowledge. To address the expanding category set including newly emerging entity types or relations on websites, prior research proposed the zero-shot MIE (ZS-MIE) task which aims to extract unseen structural knowledge with textual and visual modalities. However, the ZS-MIE models are limited to recognizing the samples that fall within the unseen category set, and they struggle to deal with real-world scenarios that encompass both seen and unseen categories. The shortcomings of existing methods can be ascribed to two main aspects. On one hand, these methods construct representations of samples and categories within Euclidean space, failing to capture the hierarchical semantic relationships between the two modalities within a sample and their corresponding category prototypes. On the other hand, there is a notable gap in the distribution of semantic similarity between seen and unseen category sets, which impacts the generative capability of the ZS-MIE models. To overcome the disadvantages, we delve into the generalized zero-shot MIE (GZS-MIE) task and propose the hyperbolic multimodal generative representation learning framework (HMGRL). The variational information bottleneck and autoencoder networks are reconstructed with hyperbolic space for modeling the multi-level hierarchical semantic correlations among samples and prototypes. Furthermore, the proposed model is trained with the unseen samples generated by the decoder, and we introduce the semantic similarity distribution alignment loss to enhance the model's generalization performance. Experimental evaluations on two benchmark datasets underscore the superiority of HMGRL compared to existing baseline methods.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References38

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Hyperbolic Multimodal Generative Representation Learning for Generalized Zero-Shot Multimodal Information Extraction

Related Papers