A*STARRUCShanghai University of EngineeringMay 30, 2026arXiv:2606.00761

Confidence-Adaptive SwiGLU for Mixture-of-Experts

Shaohua Li, Xiuchao Sui, Xiaobing Sun, Yuhang Wu, Liangli Zhen, Yong Liu, Rick Siow Mong Goh

AI Summary

This paper introduces Confidence-Aware SwiGLU (κ-SwiGLU), a novel gated activation function for Mixture-of-Experts (MoE) models that dynamically adjusts gate sharpness based on token-level routing confidence. By parameterizing the gate sharpness coefficient as a learnable function of the router logit, κ-SwiGLU allows for a flexible balance between smooth and selective gating. Evaluations on the FineWeb-Edu dataset show that κ-SwiGLU enhances mean CORE performance across various MoE Transformer architectures while maintaining low additional computational costs and minimal parameter increases.

Key Contribution

Adjusting gate sharpness based on routing confidence leads to significant performance gains in Mixture-of-Experts models without the burden of extra parameters.

Abstract

SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU (κ-SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that adjusts expert gate sharpness according to token-level routing confidence. Specifically, κ-SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate κ-SwiGLU on the FineWeb-Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, κ-SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa-swiglu.

Architecture Design (Transformers, SSMs, MoE)

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Confidence-Adaptive SwiGLU for Mixture-of-Experts

Related Papers