Search papers, labs, and topics across Lattice.
This paper introduces RotMoLE, a Mixture-of-Experts (MoE) framework tailored for low-rank adapters, which incorporates a rotational gating mechanism to enhance expert specialization. Unlike conventional MoE gating that uses scalar reweighing, RotMoLE applies a rotation to each selected expert's output. Experiments on multi-task and multilingual training demonstrate that RotMoLE achieves improved performance by better exploiting the capacity of limited expert candidates.
RotMoLE's rotational gating unlocks more representational power from low-rank MoE architectures, even when expert diversity is limited.
While Large Language Models (LLMs) are commonly fine-tuned to handle domain-specific tasks before being applied to vertical applications, adapting them to complex scenarios with diverse specialized knowledge remains challenging. Meanwhile, Mixture-of-Experts (MoE) architecture has risen as a crucial paradigm for training LLMs, and some recent works have also incorporated MoE into Parameter-Efficient Fine-Tuning (PEFT) to propose the Mixture of Low-rank Experts (MoE-LoRA), to enhance the power of low-rank adapters for learning complicated knowledge. However, conventional gating mechanisms in MoE typically apply only a scalar reweighing to selected experts, thereby limiting their underlying capacity of representation and generalization. Motivated and enabled by the low-rank structures in MoE-LoRA, we propose RotMoLE, a specialized MoE framework for low-rank experts featuring an additional rotation gate. Beyond simple scaling, RotMoLE implements a rotation mechanism for each selected expert, enabling superior expert exploitation and specialization for learning diverse data, especially when expert candidates are limited. Empirical results on complex multi-task and multilingual training scenarios validate our effectiveness.