CASCentral South UniversityPKUUniversity of International Business and EconomicsWeChat LabApr 20, 2026arXiv:2604.18326

OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation

Xing Cai, Yingjie Chen, Yiheng Li, Binxin Yang, Hao Liu, Chen Li

AI Summary

The authors introduce OmniHuman, a large-scale dataset of human-centric videos designed to address limitations in existing datasets regarding scene diversity, interaction modeling, and attribute alignment. They also present a fully automated pipeline for data collection and multi-modal annotation to generate this dataset. To evaluate models trained on OmniHuman, they introduce the OmniHuman Benchmark (OHBench), a three-level evaluation system with metrics aligned with human perception.

Key Contribution

Existing video datasets fail to capture the complexity of human interactions in diverse scenes, but OmniHuman offers a new benchmark to train and evaluate models on more realistic human-centric video generation.

Abstract

Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in complex, real-world physical scenes remains a significant challenge. We identify that the root cause lies in the structural deficiencies of existing datasets across three dimensions: limited global scene and camera diversity, sparse interaction modeling (both person-person and person-object), and insufficient individual attribute alignment. To bridge these gaps, we present OmniHuman, a large-scale, multi-scene dataset designed for fine-grained human modeling. OmniHuman provides a hierarchical annotation covering video-level scenes, frame-level interactions, and individual-level attributes. To facilitate this, we develop a fully automated pipeline for high-quality data collection and multi-modal annotation. Complementary to the dataset, we establish the OmniHuman Benchmark (OHBench), a three-level evaluation system that provides a scientific diagnosis for human-centric audio-video synthesis. Crucially, OHBench introduces metrics that are highly consistent with human perception, filling the gaps in existing benchmarks by providing a comprehensive diagnosis across global scenes, relational interactions, and individual attributes.

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation

Related Papers