May 29, 2025arXiv:2505.23661

OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Size Wu, Zhonghua Wu, Zerui Gong, Qi Tao, Sheng Jin, Qinyue Li, Wei Li, Chen Change Loy

AI Summary

The paper introduces OpenUni, a unified multimodal model for both understanding and generation, built by connecting pre-trained multimodal LLMs and diffusion models using learnable queries and a lightweight transformer connector. This approach simplifies training and reduces overhead, achieving strong performance on image generation and multimodal benchmarks with only 1.1B or 3.1B activated parameters. The authors release model weights, training code, and a 23M image-text pair dataset to facilitate open research.

Key Contribution

Achieve surprisingly strong multimodal understanding and generation with a simple connector between off-the-shelf LLMs and diffusion models, using only a fraction of the parameters of larger models.

Abstract

In this report, we present OpenUni, a simple, lightweight, and fully open-source baseline for unifying multimodal understanding and generation. Inspired by prevailing practices in unified model learning, we adopt an efficient training strategy that minimizes the training complexity and overhead by bridging the off-the-shelf multimodal large language models (LLMs) and diffusion models through a set of learnable queries and a light-weight transformer-based connector. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. To support open research and community advancement, we release all model weights, training code, and our curated training datasets (including 23M image-text pairs) at https://github.com/wusize/OpenUni.

Multimodal Models Open-Source Models & Weights Training Efficiency & Optimization

Citation Metrics

Citations25

Influential citations2

References65

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Related Papers