Search papers, labs, and topics across Lattice.
This paper reproduces and extends the analysis of activation outliers in Transformer quantization, showing a significant accuracy drop in BERT-base when applying global W8A8 quantization. Statistical analysis reveals heavy-tailed activation distributions with high kurtosis, indicating that a large portion of activation energy is concentrated in a small number of channels, especially in deeper layers. The study evaluates mixed precision and per-embedding-group quantization as mitigation strategies, finding that channel-aware precision allocation is more effective than percentile-based calibration, while also assessing deployment tradeoffs on an RTX 3050 GPU.
Naive quantization of Transformers can destroy accuracy, not because of random noise, but because a few dominant channels carry most of the signal, demanding channel-aware quantization strategies.
Post-training quantization (PTQ) of transformers is known to suffer from severe accuracy degradation due to structured activation outliers, as originally analyzed by Bondarenko et al. (EMNLP 2021) in work associated with Qualcomm AI Research. This paper provides a reproducible empirical reproduction and systems-level extension of that phenomenon in BERT-base fine-tuned on QNLI. When global W8A8 quantization is applied, validation accuracy drops sharply from 89.66% (FP32) to 54.33%, a decrease of 35.33 points. Statistical analysis of FP32 activations shows strongly heavy-tailed behavior that intensifies with model depth: kurtosis reaches 271 in the final layers and approximately 55% of activation energy is concentrated in the top 1% of channels. We evaluate several mitigation strategies. Mixed precision PTQ restores accuracy close to the FP32 baseline (89.42%). Per-embedding-group (PEG) quantization shows strong sensitivity to grouping structure, improving accuracy from 66.12% with three groups to 86.18% with four groups. In contrast, percentile-based calibration, even at thresholds between 99.0 and 99.99, fails to recover accuracy (about 50.54%), indicating that large activation channels encode structured signal rather than rare noise. Deployment profiling on an RTX 3050 GPU shows minimal differences in latency and memory usage across methods (median latency about 58-59 ms; VRAM usage about 484-486 MB), highlighting the importance of hardware-aware evaluation. Overall, the results show that PTQ failure in transformers is primarily driven by structured channel dominance amplified through residual connections. Effective mitigation therefore requires channel-aware precision allocation rather than scalar clipping alone.