Feb 17, 2026arXiv:2602.15720

ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT

Hyunchan Moon, Cheonjun Park, Steven L. Waslander

AI Summary

The paper introduces ToaSt, a decoupled pruning framework for Vision Transformers that addresses limitations of existing structured pruning and token compression methods. ToaSt applies coupled head-wise structured pruning to Multi-Head Self-Attention modules and introduces Token Channel Selection (TCS) for Feed-Forward Networks to improve compression ratios. Experiments across nine ViT models demonstrate that ToaSt achieves superior accuracy-efficiency trade-offs, with a notable result of 88.52% accuracy (+1.64%) and 39.4% FLOPs reduction on ViT-MAE-Huge.

Key Contribution

Achieve ViT efficiency gains without the optimization headaches: ToaSt's decoupled pruning framework delivers better accuracy-FLOPs trade-offs than existing methods by strategically targeting different ViT components.

Abstract

Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as promising solutions, they suffer from prolonged retraining times and global propagation that creates optimization challenges, respectively. We propose ToaSt, a decoupled framework applying specialized strategies to distinct ViT components. We apply coupled head-wise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness. For Feed-Forward Networks (over 60\% of FLOPs), we introduce Token Channel Selection (TCS) that enhances compression ratios while avoiding global propagation issues. Our analysis reveals TCS effectively filters redundant noise during selection. Extensive evaluations across nine diverse models, including DeiT, ViT-MAE, and Swin Transformer, demonstrate that ToaSt achieves superior trade-offs between accuracy and efficiency, consistently outperforming existing baselines. On ViT-MAE-Huge, ToaSt achieves 88.52\% accuracy (+1.64 \%) with 39.4\% FLOPs reduction. ToaSt transfers effectively to downstream tasks, cccccachieving 52.2 versus 51.9 mAP on COCO object detection. Code and models will be released upon acceptance.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT

Related Papers