NorthwesternApr 16, 2026arXiv:2604.14520

Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

AI Summary

The paper identifies a performance paradox in Omni-MLLMs where unimodal baselines outperform joint multimodal inference due to structural pathologies in static fusion topologies, specifically positional bias and alignment traps. To address this, they introduce Chain of Modality (CoM), an agentic framework that dynamically orchestrates input topologies and bifurcates cognitive execution into "Direct-Decide" and "Reason-Decide" pathways. CoM demonstrates robust generalization across benchmarks in both training-free and data-efficient settings by adaptively switching between parallel, sequential, and interleaved pathways.

Key Contribution

Omni-MLLMs often underperform unimodal models due to flawed fusion architectures, but a new "Chain of Modality" approach dynamically orchestrates input modalities to fix this.

Abstract

Omni-modal Large Language Models (Omni-MLLMs) promise a unified integration of diverse sensory streams. However, recent evaluations reveal a critical performance paradox: unimodal baselines frequently outperform joint multimodal inference. We trace this perceptual fragility to the static fusion topologies universally employed by current models, identifying two structural pathologies: positional bias in sequential inputs and alignment traps in interleaved formats, which systematically distort attention regardless of task semantics. To resolve this functional rigidity, we propose Chain of Modality (CoM), an agentic framework that transitions multimodal fusion from passive concatenation to dynamic orchestration. CoM adaptively orchestrates input topologies, switching among parallel, sequential, and interleaved pathways to neutralize structural biases. Furthermore, CoM bifurcates cognitive execution into two task-aligned pathways: a streamlined ``Direct-Decide''path for direct perception and a structured ``Reason-Decide''path for analytical auditing. Operating in either a training-free or a data-efficient SFT setting, CoM achieves robust and consistent generalization across diverse benchmarks.

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References31

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

Related Papers