ByteDanceJilinJIUTIAN ResearchNJUApr 30, 2026arXiv:2604.27958

TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On

Dingbao Shao, Song Wu, Shenyi Wang, Ye Wang, Ziheng Tang, Fei Liu, Jiang Lin, Xinyu Chen, Qian Wang, Ying Tai, Jian Yang, Zili Yi

AI Summary

The authors introduce TripVVT-10K, a large-scale in-the-wild triplet dataset for video virtual try-on, addressing the lack of diverse training data with explicit video-level cross-garment supervision. They then propose TripVVT, a Diffusion Transformer-based framework that uses a stable human-mask prior instead of fragile garment masks for improved background preservation and robustness. Experiments on the newly established TripVVT-Bench demonstrate that TripVVT achieves superior video quality, garment fidelity, and generalization compared to existing state-of-the-art methods.

Key Contribution

Ditching fragile garment masks for a simple human-mask prior unlocks surprisingly robust and realistic video virtual try-on, even in cluttered, in-the-wild scenarios.

Abstract

Due to the scarcity of large-scale in-the-wild triplet data and the improper use of masks, the performance of video virtual try-on models remains limited. In this paper, we first introduce **TripVVT-10K**, the largest and most diverse in-the-wild triplet dataset to date, providing explicit video-level cross-garment supervision that existing video datasets lack. Built upon this resource, we develop **TripVVT**, a Diffusion Transformer-based framework that replaces fragile garment masks with a simple, stable human-mask prior, enabling reliable background preservation while remaining robust to real-world motion, occlusion, and cluttered scenes. To support comprehensive evaluation, we further establish **TripVVT-Bench**, a 100-case benchmark covering diverse garments, complex environments, and multi-person scenarios, with metrics spanning video quality, try-on fidelity, background consistency, and temporal coherence. Compared to state-of-the-art academic and commercial systems, TripVVT achieves superior video quality and garment fidelity while markedly improving generalization to challenging in-the-wild videos. We publicly release the dataset and benchmark, which we believe provide a solid foundation for advancing controllable, realistic, and temporally stable video virtual try-on.

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On

Related Papers