Mar 16, 2026arXiv:2603.15389

When Does Sparsity Mitigate the Curse of Depth in LLMs

Dilxat Muhtar, Xinyuan Song, Sebastian Pokutta, Max Zimmer, Nico Pelleriti, Thomas Hofmann, Shiwei Liu

AI Summary

This paper investigates how sparsity, both implicit (from training) and explicit (from architecture), mitigates the curse of depth in LLMs by regulating variance propagation. Through controlled depth-scaling experiments and layer effectiveness interventions, the authors show that sparsity improves layer utilization by reducing output variance and promoting functional differentiation across layers. They distill these findings into a practical recipe for training depth-effective LLMs, achieving a 4.6% accuracy improvement on downstream tasks.

Key Contribution

Sparsity isn't just for efficiency in LLMs; it's a secret weapon against the "curse of depth," boosting layer utilization and downstream accuracy by taming variance.

Abstract

Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts. Our claim is thoroughly supported by controlled depth-scaling experiments and targeted layer effectiveness interventions. Across settings, we observe a consistent relationship: sparsity improves layer utilization by reducing output variance and promoting functional differentiation. We eventually distill our findings into a practical rule-of-thumb recipe for training deptheffective LLMs, yielding a notable 4.6% accuracy improvement on downstream tasks. Our results reveal sparsity, arising naturally from standard design choices, as a key yet previously overlooked mechanism for effective depth scaling in LLMs. Code is available at https://github.com/pUmpKin-Co/SparsityAndCoD.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

When Does Sparsity Mitigate the Curse of Depth in LLMs

Related Papers