Eastern Institute of TechnologyLMUNingbo Institute of Digital TwinNingbo Key Laboratory of SpatialNJUPolyUJun 8, 2026arXiv:2606.09080

Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

Haozhe Hu, Hao Wu, Anhao Zhao, Longwei Ding, Peiran Yin, Yunpu Ma, Xiaoyu Shen

AI Summary

This paper introduces a GEMM-centric taxonomy to categorize various pruning methods for large language model (LLM) inference acceleration, addressing the inconsistencies in practical speedup benefits across different hardware implementations. By developing a unified benchmarking framework, the authors systematically analyze the acceleration-quality Pareto frontier of these pruning techniques. Their findings reveal that static depth pruning is the most effective approach in memory-constrained scenarios, with a clear transition in optimal methods as quality loss increases.

Key Contribution

Static depth pruning emerges as the most effective strategy for LLM acceleration, achieving near-theoretical speedup limits in memory-bounded contexts.

Abstract

Pruning has emerged as a dominant paradigm for accelerating large language model (LLM) inference, spanning a broad spectrum of methods that remove computation across tokens, layers, heads, dimensions, and attention patterns. Despite sharing the same objective, these pruning approaches induce fundamentally different execution behaviors, causing realized speedups to depend heavily on hardware and kernel implementations. Consequently, the practical acceleration benefits of different pruning families remain poorly understood. In this work, we introduce a GEMM-centric taxonomy that reorganizes existing pruning methods according to the logical \textbf{M}, \textbf{N}, and \textbf{K} dimensions of general matrix multiplication (GEMM). Leveraging this abstraction, we build a unified benchmarking framework that enables implementation-consistent comparison across the pruning design space and systematically characterizes the acceleration--quality Pareto frontier. Our results show that static depth pruning remains the strongest Pareto-optimal baseline and stays closest to its theoretical acceleration upper bound in memory-bounded scenarios. During prefill, the frontier transitions from static depth at low quality loss (0\%--4\%), to dynamic depth at moderate loss (5\%--16\%), and finally to static width pruning at higher loss levels (17\%--26\%). These findings establish the first unified view of the practical limits of pruning-based LLM acceleration and provide guidance for future pruning research.\footnote{Code is available at https://github.com/EIT-NLP/LLM-Pruning/tree/main/PruningInferSim}

Eval Frameworks & Benchmarks Inference & Quantization Scaling Laws & Emergent Abilities

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

Related Papers