NVIDIAPIUPennMay 26, 2026arXiv:2605.26636

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

Dongyun Zou, Zhuoyang Zhang, Wenkun He, Qinhe Peng, Hanrong Ye, Hongxu Yin

AI Summary

JetViT, a hybrid-attention Vision Transformer, achieves state-of-the-art accuracy with significantly improved inference efficiency on high-resolution images. This is accomplished through Post-Training Attention Search, a novel framework that converts pre-trained full-attention ViTs into efficient hybrid-attention variants by strategically replacing redundant full-attention blocks with linear or window-attention blocks. JetViT demonstrates up to 1.79x higher throughput and 44.81% lower latency on an NVIDIA H100 GPU without compromising accuracy when evaluated on DINOv3 and DepthAnythingV2.

Key Contribution

Get up to 1.79x faster ViT inference on high-resolution images without sacrificing accuracy by surgically replacing full-attention blocks with cheaper alternatives *after* pre-training.

Abstract

We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full-attention vision foundation models while achieving substantially higher inference efficiency on high-resolution images. At the core of our approach is Post-Training Attention Search, a post-training acceleration framework that converts pre-trained full-attention ViTs into efficient hybrid-attention variants by identifying and replacing redundant full-attention blocks with linear or window-attention blocks. By inheriting the MLP and attention weights from the base model, Post-Training Attention Search efficiently explores the architectural design space through three key steps: (1) optimizing the linear-attention block design; (2) finding the best combination of linear-attention and window-attention blocks; and (3) identifying and preserving critical full-attention blocks. We evaluate JetViT on two representative high-resolution vision foundation models, DINOv3 and DepthAnythingV2. On the NVIDIA H100 GPU, JetViT achieves up to 1.79x higher throughput and up to 44.81% lower latency without sacrificing accuracy. We will release our code and accelerated ViT models soon.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...