Ping An Technology (Shenzhen) Co.USTCMay 25, 2026arXiv:2605.25328

DIVA: Harnessing the Representation Divergence in Unified Multimodal Models for Mutual Reinforcement

Renjie Lu, Xulong Zhang, Xiaoyang Qu, Shangfei Wang, Jianzong Wang

AI Summary

The paper investigates interference in unified multimodal models (UMMs) arising from conflicting inductive biases of understanding and generation tasks. They find that generation branches prefer high-fidelity representations, while understanding branches favor semantically discriminative embeddings, leading to mutual impairment. To address this, they propose DIVA, a post-training framework that factorizes visual representations into shared and unique components, enabling beneficial transfer while preserving task-specific information via mutual information estimation. DIVA achieves significant improvements in both visual understanding (+7.82%) and generation (+8.46%).

Key Contribution

Unified multimodal models suffer from internal conflict, but this work shows how to turn that interference into a surprisingly effective source of performance gains.

Abstract

Unified Multimodal models (UMMs) built on a single architecture have shown impressive performance in both understanding and generation. We identify a fundamental challenge that lies in inductive biases induced by distinct supervision signals: generation branch prefers high-fidelity, fine-grained representations capable of reconstruction, while the understanding favours semantically discriminative embeddings that remain invariant to task-irrelevant factors. Consequently, optimizing these complementary but non-equivalent objectives within a monolithic backbone leads to mutual impairment instead of enhancement. In this paper, we first analyze the root cause of this interference in unified backbones and reveal a complementary structure in their internal representations. Motivated by the observation, we propose DIVA, a self-improved post-training framework that transforms the representation divergence into interior synergy. By explicitly factorizing the visual representation into shared and unique components based on two complementary information flow, DIVA enables both the understanding and generation branches to achieve beneficial transferring while preserving the integrity of unique information from cross-flow interference via mutual information estimation. Despite its generality, our method consistently achieves improvements across visual understanding (+7.82%) and generation (+8.46%). The official code is available at: https://github.com/Jayyy-H/DIVA.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DIVA: Harnessing the Representation Divergence in Unified Multimodal Models for Mutual Reinforcement

Related Papers