NUSMar 5, 2026arXiv:2603.05075

UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

Yanling Li, Minghui Guo, Kaiwen Zhang, Shize Zhang, Yiran Zhao, Haodong Li, Cong Zhou, Weijie Zheng, Yu-liang Yan, Shengqiong Wu, Wei Ji, Lei Cui, Furu Wei, Hao Fei, Mong Li Lee, Wynne Hsu

AI Summary

The paper introduces UniM, a new benchmark dataset designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to handle arbitrarily combined and interleaved multimodal inputs and generate outputs in any interleaved multimedia form. UniM comprises 31K instances across 30 domains and 7 modalities (text, image, audio, video, document, code, and 3D), requiring intertwined reasoning and generation. The authors also provide an evaluation suite and a baseline model (UniMA) to assess and demonstrate the challenges of unified any-to-any multimodal intelligence.

Key Contribution

Forget unimodal tasks—UniM throws down the gauntlet for truly unified multimodal AI, demanding models juggle any combination of text, image, audio, video, code, documents, and 3D inputs and outputs in a single, interleaved stream.

Abstract

In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness&Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

Related Papers