ArceeDatologyAIPrime IntellectFeb 19, 2026arXiv:2602.17004

Arcee Trinity Large Technical Report

Varun Singh, Varun Singh, Varun Singh, Lucas Krauss, Lucas Krauss, Sami Jaghouar, Sami Jaghouar, Matej Sirovatka, Matej Sirovatka, Charles Goddard, Charles Goddard, Fares Obied, Fares Obied, Jack Min Ong, Jack Min Ong, Jannik Straube, Jannik Straube, Fern, Fern, Fern, Aria Harley, Aria Harley, C. Stewart, Conner Stewart, Colin Kealty, Colin Kealty, Maziyar Panahi, Maziyar Panahi, Simon Kirsten, Simon Kirsten, Anushka Deshpande, Anushka Deshpande, Anushka Deshpande, Anneketh Vij, Anneketh Vij, Arthur Bresnu, Arthur Bresnu, Pranav Veldurthi, Pranav Veldurthi, Raghav Ravishankar, Raghav Ravishankar, Hardik Bishnoi, Hardik Bishnoi, DatologyAI Team, DatologyAI Team, Arcee AI Team, Arcee AI Team, Prime Intellect Team, Prime Intellect Team, Mark McQuade, Mark McQuade, Johannes Hagemann, Johannes Hagemann, Lucas Atkins, Lucas Atkins

AI Summary

The paper introduces Arcee Trinity Large, a 400B parameter sparse Mixture-of-Experts model with 13B parameters activated per token, along with smaller variants Trinity Nano (6B total, 1B active) and Trinity Mini (26B total, 3B active). These models employ a modern architecture featuring interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for MoE. The authors also introduce Soft-clamped Momentum Expert Bias Updates (SMEBU) for improved MoE load balancing in Trinity Large, and train all models using the Muon optimizer, achieving stable training across 10-17 trillion tokens.

Key Contribution

A new family of sparse Mixture-of-Experts models, Arcee Trinity, achieves stable training at scale thanks to a novel MoE load balancing strategy (SMEBU).

Abstract

We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 26B total parameters with 3B activated per token. The models'modern architecture includes interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for Mixture-of-Experts. For Trinity Large, we also introduce a new MoE load balancing strategy titled Soft-clamped Momentum Expert Bias Updates (SMEBU). We train the models using the Muon optimizer. All three models completed training with zero loss spikes. Trinity Nano and Trinity Mini were pre-trained on 10 trillion tokens, and Trinity Large was pre-trained on 17 trillion tokens. The model checkpoints are available at https://huggingface.co/arcee-ai.

Architecture Design (Transformers, SSMs, MoE)Open-Source Models & Weights Scaling Laws & Emergent Abilities

Citation Metrics

Citations0

Influential citations0

References61

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Arcee Trinity Large Technical Report

Related Papers