Qualcomm AIUniversity of Technology NurembergUvAMay 24, 2026arXiv:2605.25191

Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference

Agata Żywot, Iason Skylitsis, Thijmen Nijdam, Zoe Tzifa-Kratira, Derck Prinzhorn, Konrad Szewczyk, Aritra Bhowmik

AI Summary

The paper introduces Visual Concept Fusion (VCF), a novel inference-time method for injecting visual guidance into text-to-image diffusion models like Stable Diffusion without retraining. VCF aligns CLIP image features with the text embedding space using a lightweight aligner trained with InfoNCE and cross-attention reconstruction losses, enabling the transfer of visual attributes from reference images. Experiments demonstrate VCF's ability to transfer style, composition, and color palette while maintaining prompt adherence, achieving superior reference fidelity compared to baselines.

Key Contribution

Now you can inject visual style, composition, and color from reference images into Stable Diffusion *without any training*.

Abstract

Text-to-image diffusion models like Stable Diffusion generate high-quality images from text, but lack a way to inject visual guidance (e.g. sketches, styles) at inference without retraining. Existing methods either require computationally expensive fine-tuning or rely on style transfer techniques that risk semantic misalignment with textual prompts. We introduce Visual Concept Fusion (VCF), the first method offering dual conditioning on both an image and text prompt at inference time without any concept-specific training. VCF enables visual concept injection into Stable Diffusion by aligning CLIP image features with the text embedding space. VCF consists of three components: (1) a lightweight aligner that maps image tokens to the text embedding manifold using InfoNCE and cross-attention reconstruction losses, (2) a fusion strategy that preserves both textual and visual semantics, and (3) an optional Prompt-Noise Optimization (PNO) module for test-time refinement. Our experiments demonstrate that VCF successfully transfers visual attributes including style, composition, and color palette from reference images while maintaining prompt adherence. Quantitative results show a trade-off between text alignment (CLIP score) and visual correspondence (LPIPS), with VCF outperforming baselines in reference fidelity.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference

Related Papers