Jun 8, 2026arXiv:2606.09169

IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

Lingyi Meng, Zecong Tang, Haoran Li, Tengju Ru, Zhejun Cui, Weitong Lian, Qi Kang, Hangshuo Cao, Yichen Zhu, Yechi Liu, Kaixuan Wang, Yu-Jie Yuan, Chunwei Wang, Yu Zhang, Bo Dai

AI Summary

This paper introduces IMUG-Bench, a novel benchmark designed to evaluate unified multimodal models (UMMs) on multi-turn interleaved image-text dialogues, addressing the limitations of existing benchmarks that focus on single-turn interactions. The benchmark includes over 3,100 samples and 12,000 interaction turns, allowing for a comprehensive assessment of both understanding and generation capabilities while highlighting exposure bias in multi-turn contexts. Experiments reveal significant performance gaps and suggest that test-time scaling strategies can enhance generation accuracy and mitigate exposure bias, providing critical insights for future UMM development.

Key Contribution

UMMs exhibit significant exposure bias in multi-turn interactions, revealing critical performance gaps that existing benchmarks overlook.

Abstract

In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.

Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

Related Papers