HUSTNTUMay 28, 2026arXiv:2605.29708

Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs

AI Summary

This paper investigates the interaction between safety alignment and expert specialization in Mixture-of-Experts (MoE) LLMs, finding that routing patterns are primarily topic-driven, not safety-driven. They introduce RASET, a framework that identifies and tunes safety-critical experts using a contrastive routing-sensitivity criterion, demonstrating that safety behavior can be altered with minimal impact on the model's routing. The study reveals that safety enforcement is localized in a small subset of experts, highlighting a novel MoE safety risk.

Key Contribution

Safety in MoE LLMs isn't about routing harmful requests to "refusal experts"—it's surprisingly localized within specific experts, and you can break it without significantly changing the model's overall routing behavior.

Abstract

Mixture-of-Experts (MoE) LLMs rely on sparse, router-driven expert activation, yet how safety alignment interacts with routed expert specialization remains underexplored. A common intuition is that safety behavior may be controlled by routing harmful requests to distinct refusal-oriented experts. In this work, we provide empirical evidence for a different picture: routing patterns in aligned MoE LLMs are largely topic-driven, while safety behavior can be altered with little change to the model's intrinsic routing path. Motivated by this observation, we present **RASET** (**R**outer-**A**gnostic **S**afety-critical **E**xpert **T**uning), a red-teaming framework that probes safety enforcement that is localized in a small subset of experts while preserving the model's intrinsic routing behavior. **RASET** identifies safety-critical experts via a contrastive routing-sensitivity criterion and applies parameter-efficient tuning only to the selected experts, minimizing semantic disruption relative to router-steering interventions. These results reveal a distinct MoE safety risk, highlighting the need for expert-aware alignment mechanisms.

Architecture Design (Transformers, SSMs, MoE)Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs

Related Papers