Mar 3, 2026arXiv:2603.03131

Joint Training Across Multiple Activation Sparsity Regimes

AI Summary

This paper investigates whether training neural networks across varying activation sparsity levels improves generalization. They introduce a training strategy that uses global top-k constraints on hidden activations, cycling a single model through different sparsity levels via progressive compression and periodic resets. Experiments on CIFAR-10 with a WRN-28-4 architecture demonstrate that adaptive keep-ratio control strategies outperform dense training, suggesting that joint training across multiple sparsity regimes enhances generalization.

Key Contribution

Forcing networks to perform well under varying sparsity constraints during training can surprisingly boost generalization, outperforming standard dense training.

Abstract

Generalization in deep neural networks remains only partially understood. Inspired by the stronger generalization tendency of biological systems, we explore the hypothesis that robust internal representations should remain effective across both dense and sparse activation regimes. To test this idea, we introduce a simple training strategy that applies global top-k constraints to hidden activations and repeatedly cycles a single model through multiple activation budgets via progressive compression and periodic reset. Using CIFAR-10 without data augmentation and a WRN-28-4 backbone, we find in single-run experiments that two adaptive keep-ratio control strategies both outperform dense baseline training. These preliminary results suggest that joint training across multiple activation sparsity regimes may provide a simple and effective route to improved generalization.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Joint Training Across Multiple Activation Sparsity Regimes

Related Papers