Search papers, labs, and topics across Lattice.
This paper introduces CustomShift, a dual-branch architecture for subject-driven image customization that effectively integrates reference images into text-to-image generation by employing a Conditional Attention Distribution Shift approach. By leveraging self-attention mechanisms and cross-guidance between textual and reference cues, the method addresses inefficiencies and misalignments found in existing techniques. Experiments show that CustomShift significantly enhances both semantic fidelity and subject consistency compared to state-of-the-art methods on benchmarks like DreamBooth and Custom101.
CustomShift achieves unprecedented balance between semantic fidelity and subject consistency in image generation, outperforming existing methods by leveraging a novel attention distribution shift.
Subject-driven image customization aims to generate images that not only follow textual instructions but also preserve the identity of a given reference subject. Existing approaches, including test-time fine-tuning, encoder-based methods, and token competition in shared attention spaces, suffer from limited efficiency, misalignment between extracted reference features and the generative process, and interference from irrelevant information. To address these limitations, we formulate the customization task as a distribution shift induced by incorporating reference images into text-to-image generation, and derive a Conditional Attention Distribution Shift formulation grounded in maximum entropy theory. Building on this formulation, we propose CustomShift, a dual-branch architecture based on Stable Diffusion 3. The Reference-Alignment Branch leverages self-attention between reference images and subject names to achieve layer-wise alignment with latent representations, while the Cross-Guidance Branch integrates textual and reference cues to guide generation. Experiments on the DreamBooth and Custom101 benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches, achieving a better balance between semantic fidelity and subject consistency.