AGI Research CenterImperialInclusion AIWestlakeZJUApr 22, 2026arXiv:2604.20796

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

A. Inclusion, Inclusion AI, Tiwei Bie, Hao Chen, Haoxing Chen, Tieyuan Chen, Zhenglin Cheng, Long Cui, Kailei Gan, Kai Gan, Zhicheng Huang, Zhenzhong Lan, Haoquan Li, Jianguo Li, Jianguo Li, Tao Lin, Tao Lin, Qi Qin, Qiujieli Qin, Hongjun Wang, Xiaomei Wang, Haoyuan Wu, Hao Wu, Yi Xin, Junbo Zhao, Jun Zhao

AI Summary

LLaDA2.0-Uni is a unified discrete diffusion large language model (dLLM) that natively integrates multimodal understanding and generation. It employs a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder, discretizing visual inputs via SigLIP-VQ to enable block-level masked diffusion for both text and vision. The model achieves performance comparable to specialized VLMs in multimodal understanding and excels in image generation and editing, while also supporting interleaved generation and reasoning.

Key Contribution

A single model now rivals specialized vision-language models in understanding, while also generating and editing images, thanks to a unified discrete diffusion framework.

Abstract

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References101

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Related Papers