HuaweiNTUShanghai AI LabMar 3, 2026arXiv:2603.03241

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Zimo Wen, Zimo Wen, Boxiu Li, Boxiu Li, Wanbo Zhang, Junxiang Lei, Junxiang Lei, Xiaoyu Chen, Xiaoyu Chen, Yijia Fan, Yijia Fan, Yujiang Wang, Yujiang Wang, Lili Qiu, Ziwei Liu, Caihua Shan, Yifan Yang, Yifei Shen

AI Summary

UniG2U-Bench is introduced as a benchmark to evaluate whether generative capabilities in unified multimodal models improve their understanding across 7 regimes and 30 subtasks. Evaluation of over 30 models reveals that unified models often underperform base VLMs, with Generate-then-Answer (GtA) inference typically degrading performance, except in tasks requiring spatial intelligence, visual illusions, or multi-round reasoning. The study also finds that similar reasoning structures and model architectures lead to correlated behaviors, indicating that generation-understanding coupling induces class-consistent inductive biases.

Key Contribution

Unified multimodal models often *hurt* performance on multimodal understanding tasks, except for spatial reasoning, visual illusions, and multi-round reasoning, challenging the assumption that generation universally improves understanding.

Abstract

Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Related Papers