WestlakeZJUJun 1, 2026arXiv:2606.02129

Equilibrated Diffusion: Frequency-aware Textual Embedding for Equilibrated Image Customization

AI Summary

This paper introduces Equilibrated Diffusion, a novel frequency-aware approach for image customization that disentangles concept features to enhance text-visual matching. By optimizing low and high frequency embeddings independently, the method effectively separates subject identity from stylistic elements, allowing for improved generalization to unseen prompts. Experimental results demonstrate that Equilibrated Diffusion outperforms existing methods in terms of subject fidelity and adherence to text prompts, highlighting its potential for more precise image customization.

Key Contribution

Frequency-aware embeddings reveal that separating style from content can dramatically enhance image customization fidelity and text alignment.

Abstract

Image customization learns target subjects from reference concept images and generates conditioned images per text prompts, mainly modifying styles or backgrounds. Prevailing methods adopt fine-tuning to pack diverse concept attributes into a unified latent embedding, yet entangled attributes hinder elimination of irrelevant disturbances from style and background. To address this issue, we propose Equilibrated Diffusion, a frequency-driven approach that disentangles tangled concept features for balanced customization and consistent text-visual matching. Unlike conventional methods learning full concepts with shared embeddings and unified tuning, our work utilizes the inherent link between image frequency components and semantics: low frequencies represent subject content and high frequencies correspond to styles. We decompose concepts in frequency space and optimize each embedding independently. This separate optimization enables the denoiser to capture style detached from subject identity and generalize better to unseen stylistic prompts. Merging multi-frequency embeddings preserves the model's original spatial customization ability. We further deploy mask-guided diffusion to restrict irrelevant background changes and boost text alignment. Residual Reference Attention (RRA) is inserted into spatial attention to retain subject structure and identity consistency. Experiments prove Equilibrated Diffusion exceeds mainstream baselines on subject fidelity and text adherence, verifying our method's superiority.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Equilibrated Diffusion: Frequency-aware Textual Embedding for Equilibrated Image Customization

Related Papers