UC Santa CruzApr 28, 2026arXiv:2604.25570

Vision SmolMamba: Spike-Guided Token Pruning for Energy-Efficient Spiking State-Space Vision Models

Dewei Bai, Hongxiang Peng, Yunyun Zeng, Ziyu Zhang, Hong Qu, Yi Zhang

AI Summary

Vision SmolMamba is introduced as an energy-efficient spiking state-space architecture for vision that uses spike-driven dynamics with linear-time selective recurrence. A Spike-Guided Spatio-Temporal Token Pruner (SST-TP) estimates token importance using spike activation strength and first-spike latency to remove redundant tokens. Experiments on static and event-based benchmarks show that Vision SmolMamba achieves superior accuracy-efficiency trade-offs, reducing the estimated energy cost by at least 1.5x compared to spiking Transformer baselines.

Key Contribution

By intelligently pruning tokens based on spike timing and activation, Vision SmolMamba achieves state-of-the-art efficiency in spiking neural networks, outperforming even Spiking Mamba.

Abstract

Spiking Transformers have shown strong potential for long-range visual modeling through spike-driven self-attention. However, their quadratic token interactions remain fundamentally misaligned with the sparse and event-driven nature of spiking neural computation. To address this limitation, we propose Vision SmolMamba, an energy-efficient spiking state-space architecture that integrates spike-driven dynamics with linear-time selective recurrence. The key idea is a Spike-Guided Spatio-Temporal Token Pruner (SST-TP), which estimates token importance using both spike activation strength and first-spike latency. This mechanism progressively removes redundant tokens while preserving salient spatio-temporal information, enabling efficient scaling with token sparsity. Based on this mechanism, the proposed SmolMamba block incorporates spike events directly into bidirectional state-space recurrence, forming a spiking state-space vision backbone for efficient long-range modeling. Extensive experiments on both static and event-based benchmarks, including ImageNet-1K, CIFAR10/100, CIFAR10-DVS, and DVS128 Gesture, demonstrate that Vision SmolMamba consistently achieves superior accuracy-efficiency trade-offs. In particular, it reduces the estimated energy cost by at least 1.5x compared with prior spiking Transformer baselines and a Spiking Mamba variant while maintaining competitive or improved accuracy. These results demonstrate that combining spike-guided token sparsity with state-space modeling offers a scalable and energy-efficient paradigm for spiking vision systems.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Vision SmolMamba: Spike-Guided Token Pruning for Energy-Efficient Spiking State-Space Vision Models

Related Papers