Search papers, labs, and topics across Lattice.
University of Science and Technology of China, Hefei, China liangg@lamda.nju.edu.cn, liuxinyao@mail.ustc.edu.cn, wujx2001@nju.edu.cn Corresponding author. Abstract Vision Transformers (ViTs) are essential in computer vision but are computationally intensive, too. Model quantization, particularly to low bit-widths like 4-bit, aims to alleviate this difficulty, yet existing Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) methods exhibit significant limitations. PTQ often incurs substantial accuracy drop, while QAT achieves high accuracy but suffers from prohibitive computational costs, limited generalization to downstream tasks, training instability, and lacking of open-source codebase. To address these challenges, this paper introduces General, Practical, and Lightning Quantization (GPLQ), a novel framework designed for efficient and effective ViT quantization. GPLQ is founded on two key empirical insights: the paramount importance of activation quantization and the necessity of preserving the model’s original optimization “basin” to maintain generalization. Consequently, GPLQ employs a sequential “activation-first, weights-later” strategy. Stage 1 keeps weights in FP32 while quantizing activations with a feature mimicking loss in only 1 epoch to keep it stay in the same “basin”, thereby preserving generalization. Stage 2 quantizes weights using a PTQ method. As a result, GPLQ is 100x faster than existing QAT methods, lowers memory footprint to levels even below FP32 training, and achieves 4-bit model performance that is highly competitive with FP32 models in terms of both accuracy on ImageNet and generalization to diverse downstream tasks, including fine-grained visual classification and object detection. We will release an easy-to-use open-source toolkit supporting multiple vision tasks. 1 Introduction Vision Transformer (ViT) dosovitskiy2020image ; vaswani2017attention has emerged as the mainstream backbone network in computer vision, but it demands substantial computational and memory resources. Model quantization is one of the key techniques to address this challenge by reducing the numerical precision of model parameters and/or activation valueslang2024comprehensive ; li2022q . However, existing quantization methods still faces challenges, especially in low-bit (e.g., 4-bit) quantization. Mainstream methods include Post-Training Quantization (PTQ) liu2021post and Quantization-Aware Training (QAT) esser2019learned . PTQ has fast speed and low resource consumption, but often leads to large accuracy drop under 4-bit quantization li2023repq . On the other hand, QAT simulates quantization operations during training and enables higher accuracy than that of PTQ, or even higher than that of floating-point models. Nevertheless, in this paper we will show that existing QAT methods have inherent limitations: • High Computational Costs. QAT requires lengthy fine-tuning of the entire model. Training time and GPU memory required in QAT often far exceed those for training the FP32 model lang2024comprehensive . This makes QAT cumbersome and very slow for deployment in real-world applications. • Limited Generalization Ability. QAT methods often boast higher accuracy than their FP32 counterparts. However, in this paper we will show that such models are generalizing worse than FP32 or PTQ quantized models in downstream tasks. That is, they are likely non-generalizable beyond ImageNet deng2009imagenet , the dataset on which they were trained. • Training Instability and Complexity. QAT is prone to training instability huang2023quantization , and complex Knowledge Distillation (KD) techniques li2022q ; huang2023quantization severely increase memory footprint. Some also rely on external, extremely powerful teacher models, which are not available in practical scenarios. In short, existing QAT methods are not practical. • Classification Only and Code Missing. Open-source code for QAT is rare, and is only for classification when it exists. This further makes QAT impractical for real-world applications. To this end, we propose GPLQ (General, Practical, and Lightning Quantization). The core objective of GPLQ is to provide a quantization solution that is far more training-efficient than traditional QAT, superior to PTQ in accuracy and generalization, easy to use, and highly practical. As a result, Figure 1 demonstrates 3 core advantages of GPLQ. • General. GPLQ exhibits excellent average accuracy on multiple downstream tasks: close to or even surpassing FP32 models, and significantly outperforming existing QAT methods. • Practical. GPLQ has very small training memory footprint (far lower than existing QAT methods), which avoids out-of-memory (OOM) issues in many applications and enables quantization of larger models. GPLQ’s design allows it to be conveniently applied to other tasks such as object detection. • Lightning. GPLQ is blazingly fast: hundreds of times faster than existing QAT methods. Figure 1: Core advantages of our GPLQ: Generality, Practicality, and Lightning efficiency. GPLQ is based on our empirical findings. First, activations are far more important than weights in low-bit quantization. Second, quantization should not change its optimization “basin” (i.e., avoid jumping out of the current local minimum) in order to keep the generalization ability. Based on these findings, GPLQ adopts a sequential quantization paradigm. First, activations are quantized with weights kept at FP32. To maintain generalization, we draw inspiration from TCS zhou2025all and employ a PCA-based feature mimicking loss to guide the quantized model’s feature outputs to approximate those of the original FP32 model (i.e., stay in the same basin). Second, after activations are quantized, existing efficient PTQ methods are used to quantize the weights. This “activation-first, weights later” strategy not only drastically reduces QAT training time from days to 1-2 hours and with memory footprint even lower than FP32 training, but also allows a 4-bit model to achieve both accuracy and generalization nearly identical to the original FP32 model. The main contributions are: 1. Insights. We reveal that activation quantization is the main bottleneck in QAT, and staying in the original optimization basin is crucial for generalization. 2. GPLQ. We propose “activation-first” sequential quantization: first optimize activations then quantize weights via PTQ. 3. Code. GPLQ provides an easy-to-use quantization tool supporting classification, detection and other downstream tasks. We will open-source GPLQ upon paper acceptance. 2 Related Work Model quantization aims to enhance model efficiency by reducing the numerical precision of weights and activations in neural networks papa2024survey . Post-Training Quantization (PTQ). PTQ operates without retraining, requires only a small calibration set, and is very fast. Various techniques have been proposed: AIQViT jiang2025aiqvit , GPTQ frantar2022gptq , PTQ
1
4
3
2
Quantizing Vision Transformers to 4-bit precision no longer requires a painful trade-off between accuracy, speed, and memory, thanks to a new activation-first training method that's 100x faster.