Mar 12, 2026arXiv:2603.12222

HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers

Andy Li, A. Durrant, Aiden Durrant, Milan Markovic, G. Leontidis, Georgios Leontidis

AI Summary

HiAP, a novel hierarchical auto-pruning framework, is introduced to address the computational demands of Vision Transformers by simultaneously pruning at both macro (attention heads/FFN blocks) and micro (intra-head dimensions/FFN neurons) granularities. The method employs stochastic Gumbel-Sigmoid gates and a loss function incorporating structural feasibility and analytical FLOPs to enable end-to-end training of sparse sub-networks. Experiments on ImageNet demonstrate HiAP's ability to discover efficient architectures and achieve a competitive accuracy-efficiency trade-off compared to multi-stage pruning methods, while simplifying the deployment pipeline.

Key Contribution

Forget complex, multi-stage pruning pipelines: HiAP slashes Vision Transformer size with a single, end-to-end training pass that optimizes sparsity at multiple granularities.

Abstract

Vision Transformers require significant computational resources and memory bandwidth, severely limiting their deployment on edge devices. While recent structured pruning methods successfully reduce theoretical FLOPs, they typically operate at a single structural granularity and rely on complex, multi-stage pipelines with post-hoc thresholding to satisfy sparsity budgets. In this paper, we propose Hierarchical Auto-Pruning (HiAP), a continuous relaxation framework that discovers optimal sub-networks in a single end-to-end training phase without requiring manual importance heuristics or predefined per-layer sparsity targets. HiAP introduces stochastic Gumbel-Sigmoid gates at multiple granularities: macro-gates to prune entire attention heads and FFN blocks, and micro-gates to selectively prune intra-head dimensions and FFN neurons. By optimizing both levels simultaneously, HiAP addresses both the memory-bound overhead of loading large matrices and the compute-bound mathematical operations. HiAP naturally converges to stable sub-networks using a loss function that incorporates both structural feasibility penalties and analytical FLOPs. Extensive experiments on ImageNet demonstrate that HiAP organically discovers highly efficient architectures, and achieves a competitive accuracy-efficiency Pareto frontier for models like DeiT-Small, matching the performance of sophisticated multi-stage methods while significantly simplifying the deployment pipeline.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References34

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers

Related Papers