BeihangApr 5, 2026arXiv:2604.03957

BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

Yifu Ding, Xianglong Liu, Shenghao Jin, Jinyang Guo, Jiwen Lu

AI Summary

This paper introduces BWTA, a Binary Weights & Ternary Activations quantization scheme for Transformers that addresses accuracy degradation in ultra low-bit quantization by projecting tiny values to zero. They propose Smooth Multi-Stage Quantization for stable training and develop a custom CUDA kernel with instruction-level parallel bit-packing for efficient inference. BWTA achieves near full-precision performance on BERT and comparable results on LLMs, while providing significant kernel-level and end-to-end speedups on NVIDIA GPUs.

Key Contribution

Binarizing weights and ternarizing activations in Transformers can deliver 16-24x kernel speedup and comparable accuracy to full-precision models, finally making ultra-low-bit quantization practical.

Abstract

Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU support hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For training, we propose Smooth Multi-Stage Quantization, combining a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor to enable stable and fast convergence. For inference, we develop a BWTA MatMul CUDA kernel with instruction-level parallel bit-packing and comprehensive binary/ternary MatMul implementations for both linear and attention operators, allowing seamless integration across Transformer architectures. Experiments show that BWTA approaches full-precision performance for BERT, with an average 3.5% drop on GLUE and less than 2% drop on five tasks, and achieves comparable perplexity and accuracy for LLMs. In efficiency, it delivers 16 to 24 times kernel-level speedup over FP16 on NVIDIA GPUs, and 216 to 330 tokens/s end-to-end prefill speedup with lower memory footprint on LLMs. As an algorithm-hardware co-design, BWTA demonstrates practical, low-latency ultra-low-bit inference without sacrificing model quality.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

Related Papers