Microsoft ResearchXiamen UniversityMar 10, 2026arXiv:2603.09236

BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off

Shuang Liu, Ao Yu, Linkang Cheng, Xiwen Huang, Li Zhao, Junhui Liu, Zhiting Lin

AI Summary

The paper introduces BridgeDiff, a diffusion-based framework for virtual try-off (VTOFF) that bridges the gap between human-centric observations and flat-garment synthesis. BridgeDiff incorporates a Garment Condition Bridge Module (GCBM) to capture global garment appearance and semantic identity, and a Flat Structure Constraint Module (FSCM) with Flat-Constraint Attention (FC-Attention) to inject flat-garment structural priors during denoising. Experiments on VTOFF benchmarks demonstrate state-of-the-art performance in generating high-quality flat-garment reconstructions with improved appearance and structural integrity.

Key Contribution

By explicitly bridging the gap between on-body appearances and flat layouts, BridgeDiff achieves state-of-the-art virtual try-off results, generating more realistic and structurally sound flat-garment representations.

Abstract

Virtual try-off (VTOFF) aims to recover canonical flat-garment representations from images of dressed persons for standardized display and downstream virtual try-on. Prior methods often treat VTOFF as direct image translation driven by local masks or text-only prompts, overlooking the gap between on-body appearances and flat layouts. This gap frequently leads to inconsistent completion in unobserved regions and unstable garment structure. We propose BridgeDiff, a diffusion-based framework that explicitly bridges human-centric observations and flat-garment synthesis through two complementary components. First, the Garment Condition Bridge Module (GCBM) builds a garment-cue representation that captures global appearance and semantic identity, enabling robust inference of continuous details under partial visibility. Second, the Flat Structure Constraint Module (FSCM) injects explicit flat-garment structural priors via Flat-Constraint Attention (FC-Attention) at selected denoising stages, improving structural stability beyond text-only conditioning. Extensive experiments on standard VTOFF benchmarks show that BridgeDiff achieves state-of-the-art performance, producing higher-quality flat-garment reconstructions while preserving fine-grained appearance and structural integrity.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References49

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off

Related Papers