CMU MLHKUMay 27, 2026arXiv:2605.28306

Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in Mixture-of-Experts Models

Guanzhi Deng, Kuan Wu, Shing Yin Wong, Sichun Luo

AI Summary

This paper introduces Routing-Aligned MoE Fine-Tuning (RA-MoE), a novel three-stage framework to improve multilingual downstream task performance in MoE models by explicitly aligning routing behavior with English task-expert activation patterns. RA-MoE categorizes task examples based on correctness in English and the target language, identifies task-relevant experts, and uses a routing alignment loss to encourage target-language routing to follow English routing patterns for examples correct only in English. Experiments across three MoE models, three tasks, and six languages show that RA-MoE outperforms standard SFT and other routing-based baselines, with the proportion of examples correct only in English predicting the benefit of alignment.

Key Contribution

Explicitly aligning MoE routing behavior during fine-tuning can significantly boost performance on multilingual tasks, especially when the model understands the task in English but struggles in the target language.

Abstract

Mixture-of-Experts (MoE) models have emerged as a dominant paradigm for efficient LLM scaling, yet adapting them to non-English downstream tasks remains challenging. Existing fine-tuning approaches treat MoE models as monolithic learners, ignoring the heterogeneous routing structure that develops during pretraining. We validate across multiple MoE models and downstream tasks that middle layers form a language-universal alignment zone where routing divergence strongly predicts per-language task performance gaps. Building on this observation, we propose RA-MoE (Routing-Aligned MoE Fine-Tuning), a three-stage framework that categorizes parallel task examples into a four-way taxonomy (cc/ci/ic/ii) based on correctness in English and the target language, identifies task-relevant experts in the middle layers, and augments standard SFT with a routing alignment loss that encourages target-language routing on ci-type examples to follow the English task-expert activation pattern. Experiments across three MoE models, three tasks, and six target languages demonstrate that RA-MoE consistently outperforms standard SFT and strong baselines including Routing Steering and RISE, with the ci proportion of a task-language pair serving as a reliable predictor of alignment benefit.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in Mixture-of-Experts Models

Related Papers