Search papers, labs, and topics across Lattice.
The paper introduces CORAL, a Diffusion Transformer (DiT)-based framework for unpaired virtual try-on that explicitly enforces person-garment alignment by improving query-key matching within the full 3D attention mechanism. They identify that precise person-garment query-key matching is critical for correspondence in DiT-based VTON and address the limitations of existing methods that lack explicit alignment. CORAL achieves improved global shape transfer and local detail preservation through a correspondence distillation loss and an entropy minimization loss, validated by a new VLM-based evaluation protocol.
By explicitly aligning attention with external correspondences, CORAL significantly improves detail preservation in virtual try-on, addressing a key limitation of existing Diffusion Transformer methods.
Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person-garment correspondence is required. These methods do not explicitly enforce person-garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that the person-garment correspondence critically depends on precise person-garment query-key matching within the full 3D attention. Building on this insight, we then introduce CORrespondence ALignment (CORAL), a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person-garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.