Microsoft ResearchBITMar 5, 2026arXiv:2603.05232

SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

Hanyong Shao, Yingbo Hao, Ting Song, Yan Xia, Shaohan Huang, Songcheng Xu, Songchen Xu, Le Xu, Li Dong, Zewen Chi, Yinxue Zou, Yi Zou, Furu Wei

AI Summary

SlideSparse is introduced as a system to enable Sparse Tensor Core acceleration on commodity GPUs for (2N-2):2N structured sparsity patterns, which offer a better accuracy-sparsity trade-off than the hardware-supported 2:4 sparsity. It achieves this by using Sliding Window Decomposition to reconstruct (2N-2):2N weight blocks into overlapping 2:4-compliant windows and Activation Lifting to fuse activation rearrangement into per-token quantization. Integrated into vLLM, SlideSparse achieves speedups approaching the theoretical upper bound on compute-bound workloads across various GPUs, precisions, and model families.

Key Contribution

Unlock 33% faster LLM inference on commodity GPUs with SlideSparse, which finally brings hardware-accelerated (2N-2):2N sparsity to the masses, bridging the accuracy gap left by NVIDIA's strict 2:4 pruning.

Abstract

NVIDIA's 2:4 Sparse Tensor Cores deliver 2x throughput but demand strict 50% pruning -- a ratio that collapses LLM reasoning accuracy (Qwen3: 54% to 15%). Milder $(2N-2):2N$ patterns (e.g., 6:8, 25% pruning) preserve accuracy yet receive no hardware support, falling back to dense execution without any benefit from sparsity. We present SlideSparse, the first system to unlock Sparse Tensor Core acceleration for the $(2N-2):2N$ model family on commodity GPUs. Our Sliding Window Decomposition reconstructs any $(2N-2):2N$ weight block into $N-1$ overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost. Integrated into vLLM, SlideSparse is evaluated across various GPUs (A100, H100, B200, RTX 4090, RTX 5080, DGX-spark), precisions (FP4, INT8, FP8, BF16, FP16), and model families (Llama, Qwen, BitNet). On compute-bound workloads, the measured speedup ratio (1.33x) approaches the theoretical upper-bound $N/(N-1)=4/3$ at 6:8 weight sparsity in Qwen2.5-7B, establishing $(2N-2):2N$ as a practical path to accuracy-preserving LLM acceleration. Code available at https://github.com/bcacdwk/vllmbench.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

Related Papers