Tsinghua AIJilinMar 13, 2026arXiv:2603.12793

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Yichen Zhang, Danhong Peng, Da Peng, Zonghao Guo, Xuesong Yang, Tong Sun, Shichu Sun, Yidan Zhang, Yanghao Li, Haiyan Zhao, Wang Xu, Yangang Sun, Yukun Yan, Wei Ke, Wei Ke

AI Summary

Cheers, a unified multimodal model, decouples patch-level details from semantic representations to improve both visual understanding and image generation. It uses a unified vision tokenizer for efficient LLM conditioning, an LLM-based Transformer for both text and image generation, and a cascaded flow matching head to inject semantically gated detail residuals. Cheers achieves state-of-the-art performance on visual understanding and generation benchmarks while also achieving 4x token compression, significantly reducing training costs compared to other models like Tar-1.5B.

Key Contribution

By decoupling patch details from semantics, Cheers achieves state-of-the-art multimodal performance at 20% of the training cost of comparable models.

Abstract

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Related Papers