ByteDancegithubHKUNJUJun 15, 2026arXiv:2606.16255

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

Shuai Wang, Liang Li, Yang Chen, Ruopeng Gao, Yao Teng, Limin Wang

AI Summary

This paper introduces UniDDT, a novel framework that unifies multimodal understanding and generation by employing a Noisy ViT encoder and a decoupled diffusion decoder. By addressing the learning conflicts and scalability issues present in existing Unified Multimodal Models (UMMs), UniDDT achieves enhanced semantic consistency and performance across both visual generation and understanding tasks. The model demonstrates significant improvements, achieving a GenEval score of 0.87 and a MME benchmark score of 1699.5, showcasing its effectiveness in balancing the duality of text-image interactions.

Key Contribution

UniDDT achieves a groundbreaking balance between multimodal understanding and generation, outperforming existing models in both tasks with enhanced semantic coherence.

Abstract

Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding. With this Noisy ViT encoder, UniDDT is able to leverage the latent space as a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we construct dual data structures from the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification of multimodal understanding and generation with enhanced semantic consistency and scalability. For visual generation tasks, our UniDDT achieves 0.87 GenEval score and 86.9 DPG overall score. For multimodal understanding tasks, our UniDDT achieves 1699.5 score on MME benchmark and 76.5 overall score on SEEDbench.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

Related Papers