Search papers, labs, and topics across Lattice.
This paper investigates the interaction between safety alignment and expert specialization in Mixture-of-Experts (MoE) LLMs, finding that routing patterns are primarily topic-driven, not safety-driven. They introduce RASET, a framework that identifies and tunes safety-critical experts using a contrastive routing-sensitivity criterion, demonstrating that safety behavior can be altered with minimal impact on the model's routing. The study reveals that safety enforcement is localized in a small subset of experts, highlighting a novel MoE safety risk.
Safety in MoE LLMs isn't about routing harmful requests to "refusal experts"鈥攊t's surprisingly localized within specific experts, and you can break it without significantly changing the model's overall routing behavior.
Mixture-of-Experts (MoE) LLMs rely on sparse, router-driven expert activation, yet how safety alignment interacts with routed expert specialization remains underexplored. A common intuition is that safety behavior may be controlled by routing harmful requests to distinct refusal-oriented experts. In this work, we provide empirical evidence for a different picture: routing patterns in aligned MoE LLMs are largely topic-driven, while safety behavior can be altered with little change to the model's intrinsic routing path. Motivated by this observation, we present **RASET** (**R**outer-**A**gnostic **S**afety-critical **E**xpert **T**uning), a red-teaming framework that probes safety enforcement that is localized in a small subset of experts while preserving the model's intrinsic routing behavior. **RASET** identifies safety-critical experts via a contrastive routing-sensitivity criterion and applies parameter-efficient tuning only to the selected experts, minimizing semantic disruption relative to router-steering interventions. These results reveal a distinct MoE safety risk, highlighting the need for expert-aware alignment mechanisms.