Sofia University "St. Kliment Ohridski"University of Modena and Reggio EmiliaMay 21, 2026arXiv:2605.22132

Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

Carmelo Scribano, Mohammad Mahdi, Nedyalko Prisadnikov, Yuqian Fu, Giorgia Franchini, Danda Pani Paudel, Marko Bertogna, Luc Van Gool

AI Summary

This paper introduces a method to accelerate Vision Transformers (ViTs) by replacing specific attention heads with a depthwise convolution-based layer, leveraging the observed convolution-like behavior of certain attention heads. They propose strategies for identifying replaceable heads and a fine-tuning procedure to maintain performance. Experiments on image classification and segmentation demonstrate a 17-20% inference speedup with minimal performance loss.

Key Contribution

Get up to 20% faster ViT inference by hot-swapping certain attention heads for depthwise convolutions – without tanking accuracy.

Abstract

Pretrained vision foundation models deliver strong performance across tasks with limited fine-tuning. However, their Vision Transformer (ViT) backbones impose high inference costs, limiting deployment on resource-constrained devices. In this work, we accelerate large-scale pretrained ViTs while preserving their feature extraction capabilities by exploiting the intrinsic convolution-like behavior of some attention heads. Specifically, we introduce an efficient depthwise convolution-based layer that serves as a drop-in replacement for these heads. Additionally, we propose simple strategies to identify which heads can be replaced and introduce a fine-tuning procedure that recovers downstream task performance. Across both image classification and segmentation tasks, our method achieves 17-20\% percent inference speedup with minimal performance degradation. We validate the approach through detailed derivations, extensive experiments, and efficiency benchmarks. The reference implementation is publicly available.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

Related Papers