CUHKHKUMonashApr 13, 2026arXiv:2604.11804

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin, Xiaohu Huang, Yichen Liu, Xin Gao, Cunjian Chen, Shilei Wen, Chi-Wing Fu, Pheng-Ann Heng

AI Summary

The paper introduces OmniShow, a framework for Human-Object Interaction Video Generation (HOIVG) conditioned on text, images, audio, and pose, addressing the limitations of existing methods in handling diverse multimodal inputs. OmniShow employs Unified Channel-wise Conditioning and Gated Local-Context Attention to balance controllability and quality, along with a Decoupled-Then-Joint Training strategy to mitigate data scarcity. The authors also introduce HOIVG-Bench, a new benchmark, and demonstrate state-of-the-art performance of OmniShow across various multimodal conditioning scenarios.

Key Contribution

Generating realistic human-object interaction videos from text, images, audio, *and* pose is now possible, opening the door to automated content creation workflows.

Abstract

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

Related Papers