Microsoft ResearchSNUUniversity of Science and TechnologyJun 10, 2026arXiv:2606.11783

A Comprehensive Ecosystem for Open-Domain Customized Video Generation

Jingxu Zhang, Yuqian Hong, Daneul Kim, Kai Qiu, Qi Dai, Jianmin Bao, Chong Luo

AI Summary

This paper addresses the limitations of open-domain customized video generation by introducing PexelsCustom-1M, a large-scale dataset of one million identity-specific video triplets across over 8,000 categories. Utilizing this dataset, the authors present CustoMDiT, a parameter-efficient framework that adapts a pretrained multimodal Diffusion Transformer, achieving state-of-the-art performance with only 8% additional learnable parameters. Furthermore, they establish OpenCustom, a benchmark with over 1,000 categories, demonstrating the effectiveness of their approach through extensive experiments.

Key Contribution

A million-scale dataset for identity-preserving video generation enables a new benchmark that outperforms existing models with minimal parameter overhead.

Abstract

Recent progress in video generation has shown impressive visual synthesis capabilities. However, open-domain customized video generation remains limited by the lack of large-scale, annotated datasets capturing diverse identity-specific attributes. To address this, we introduce PexelsCustom-1M, the first publicly available million-scale dataset for identity-preserving video generation, containing one million curated <identity, text, video> triplets across 8,000+ categories. Leveraging this, we propose CustoMDiT, a parameter-efficient framework that adapts a pretrained multimodal Diffusion Transformer into a customized video generator with only 8% additional learnable parameters. Our method surpasses prior state-of-the-art. However, benchmarks such as DreamBooth cover only 100 classes, which is insufficient for real-world applications. To overcome this, we construct OpenCustom, a new benchmark with 1,000+ categories, created via cross-dataset knowledge fusion from ImageNet and MS-COCO. Extensive experiments confirm the advantages of both our dataset and model. We will open-source the entire ecosystem--including dataset, pipeline, benchmark, and implementations--to support further research.

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Comprehensive Ecosystem for Open-Domain Customized Video Generation

Related Papers