German Research Center for AI;TechnicalRutgersOct 14, 2025arXiv:2510.12789

UniFusion: Vision-Language Model as Unified Encoder in Image Generation

Kevin Li, Manuel Brack, Sudeep Katakol, Hareesh Ravi, Ajinkya Kale

AI Summary

The paper introduces UniFusion, a diffusion-based generative model that uses a frozen vision-language model (VLM) as a unified encoder for both text and images, addressing the limitations of separate encoders in cross-modal reasoning. UniFusion employs a Layerwise Attention Pooling (LAP) mechanism to extract both high-level semantics and low-level details from the VLM's text and visual tokens to condition the diffusion model. The proposed VLM-Enabled Rewriting Injection with Flexible Inference (VERIFI) further enhances the model's reasoning capabilities and flexibility by conditioning the diffusion transformer on text tokens generated by the VLM during in-model prompt rewriting.

Key Contribution

Ditch separate image and text encoders: UniFusion uses a single frozen VLM to generate and edit images, achieving better text-image alignment and zero-shot generalization.

Abstract

Although recent advances in visual generation have been remarkable, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models'ability to perform cross-modal reasoning and knowledge transfer. Prior attempts to bridge this gap often use the last layer information from VLM, employ multiple visual encoders, or train large unified models jointly for text and image generation, which demands substantial computational resources and large-scale data, limiting its accessibility.We present UniFusion, a diffusion-based generative model conditioned on a frozen large vision-language model (VLM) that serves as a unified multimodal encoder. At the core of UniFusion is the Layerwise Attention Pooling (LAP) mechanism that extracts both high level semantics and low level details from text and visual tokens of a frozen VLM to condition a diffusion generative model. We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithful transfer of visual information from VLM to the diffusion model which is key for editing. We propose VLM-Enabled Rewriting Injection with Flexibile Inference (VERIFI), which conditions a diffusion transformer (DiT) only on the text tokens generated by the VLM during in-model prompt rewriting. VERIFI combines the alignment of the conditioning distribution with the VLM's reasoning capabilities for increased capabilities and flexibility at inference. In addition, finetuning on editing task not only improves text-image alignment for generation, indicative of cross-modality knowledge transfer, but also exhibits tremendous generalization capabilities. Our model when trained on single image editing, zero-shot generalizes to multiple image references further motivating the unified encoder design of UniFusion.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations3

Influential citations1

References44

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

UniFusion: Vision-Language Model as Unified Encoder in Image Generation

Related Papers