Search papers, labs, and topics across Lattice.
The paper introduces Symbiotic-MoE, a unified pre-training framework for LMMs that mitigates catastrophic forgetting in understanding tasks caused by image generation. It addresses routing collapse in MoE architectures by using Modality-Aware Expert Disentanglement, partitioning experts into task-specific and shared groups to facilitate cross-modal synergy. A Progressive Training Strategy with differential learning rates and gradient shielding further optimizes the transfer of generative signals to enhance understanding, leading to significant performance gains on MMLU and OCRBench.
LMMs can learn to generate images *and* improve their understanding abilities, without catastrophic forgetting, by carefully disentangling and sharing experts within a MoE architecture.
Empowering Large Multimodal Models (LMMs) with image generation often leads to catastrophic forgetting in understanding tasks due to severe gradient conflicts. While existing paradigms like Mixture-of-Transformers (MoT) mitigate this conflict through structural isolation, they fundamentally sever cross-modal synergy and suffer from capacity fragmentation. In this work, we present Symbiotic-MoE, a unified pre-training framework that resolves task interference within a native multimodal Mixture-of-Experts (MoE) Transformers architecture with zero-parameter overhead. We first identify that standard MoE tuning leads to routing collapse, where generative gradients dominate expert utilization. To address this, we introduce Modality-Aware Expert Disentanglement, which partitions experts into task-specific groups while utilizing shared experts as a multimodal semantic bridge. Crucially, this design allows shared experts to absorb fine-grained visual semantics from generative tasks to enrich textual representations. To optimize this, we propose a Progressive Training Strategy featuring differential learning rates and early-stage gradient shielding. This mechanism not only shields pre-trained knowledge from early volatility but eventually transforms generative signals into constructive feedback for understanding. Extensive experiments demonstrate that Symbiotic-MoE achieves rapid generative convergence while unlocking cross-modal synergy, boosting inherent understanding with remarkable gains on MMLU and OCRBench.