Apr 27, 2026arXiv:2604.25072

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

Weixing Wang, Liudvikas Zekas, Anton Hackl, Constantin Alexander Auga, Parisa Shahabinejad, Jona Otholt, Antonio Rueda-Toicen, Gerard de Melo

AI Summary

The paper introduces XTC-Bench, a scene-graph-grounded evaluation framework, and CCTA, a fine-grained metric, to measure cross-task visual semantic consistency in unified multimodal models (uMMs). It reveals that high generation or understanding performance does not guarantee strong cross-task alignment, indicating a lack of coherent unified representations. The study finds that consistency is more dependent on the coupling of learning objectives across modalities than on architectural unification itself.

Key Contribution

Unified multimodal models can ace visual understanding and generation tasks, yet still fail to maintain basic semantic consistency between them.

Abstract

Unified Multimodal Models (uMMs) aim to support both visual understanding and visual generation within a shared representation. However, existing evaluation protocols assess these two capabilities independently and do not examine whether they are semantically aligned. As a result, it remains unclear whether current uMMs learn coherent unified representations that remain consistent across tasks given a visual concept. We introduce XTC-Bench, a scene-graph-grounded evaluation framework that measures cross-task visual semantic consistency. By deriving both generation prompts and understanding queries from a structured scene graph, our framework enables fact-level alignment analysis across objects, attributes, and relations. We propose Continuous Cross-Task Agreement (CCTA), a fine-grained metric that quantifies semantic agreement between generation and understanding over matched atomic facts, isolating internal consistency from standalone task accuracy. Extensive experiments on eight open-source and one commercial unified models reveal that high generation or understanding performance does not imply strong cross-task alignment, and architectural analysis shows consistency is governed by how tightly learning objectives are coupled across modalities, not by architectural unification alone. XTC-Bench provides a reproducible and model-agnostic framework for diagnosing representation-level misalignment, offering a concrete direction for advancing unified multimodal modeling beyond isolated task performance.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

Related Papers