School of Traffic & Transportation EngineeringApr 9, 2026arXiv:2604.08541

Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

Haolei Xu, Haiwen Hong, Haiwen Hong, Rui Zhou, Yang Zhang, Yang Zhang, Longtao Huang, Longtao Huang, Hui Xue, Hui Xue, Yongliang Shen, Yueting Zhuang

AI Summary

This paper identifies a "Seeing but Not Thinking" phenomenon in multimodal MoE models, where visual inputs lead to reasoning failures despite accurate perception. The authors hypothesize that routing mechanisms inadequately activate task-relevant reasoning experts when processing visual inputs, leading to performance degradation. They validate this by intervening on the routing mechanism to enhance domain expert activation, achieving consistent improvements across multiple models and benchmarks.

Key Contribution

Multimodal models can "see" the image but still fail at reasoning because the visual input distracts the routing mechanism from activating the right experts.

Abstract

Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, we design a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. Our analysis further reveals that domain expert identification locates cognitive functions rather than sample-specific solutions, enabling effective transfer across tasks with different information structures.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References36

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

Related Papers