Search papers, labs, and topics across Lattice.
LBLLM introduces a three-stage quantization strategy for binarizing LLMs to W(1+1)A4, involving PTQ initialization, layer-wise distillation for weights and quantization parameters, and learnable activation quantization factors. This decoupled approach reduces interference between weight and activation quantization, leading to improved training stability and accuracy. LBLLM, trained on a small dataset and single GPU, outperforms existing binarization methods on language modeling, commonsense QA, and language understanding tasks, demonstrating the feasibility of extreme low-bit quantization.
LLMs can be aggressively quantized to W(1+1)A4 without significant performance degradation using a surprisingly simple three-stage distillation approach.
Deploying large language models (LLMs) in resource-constrained environments is hindered by heavy computational and memory requirements. We present LBLLM, a lightweight binarization framework that achieves effective W(1+1)A4 quantization through a novel three-stage quantization strategy. The framework proceeds as follows: (1) initialize a high-quality quantized model via PTQ; (2) quantize binarized weights, group-wise bitmaps, and quantization parameters through layer-wise distillation while keeping activations in full precision; and (3) training learnable activation quantization factors to dynamically quantize activations to 4 bits. This decoupled design mitigates interference between weight and activation quantization, yielding greater training stability and better inference accuracy. LBLLM, trained only using 0.016B tokens with a single GPU, surpasses existing state-of-the-art binarization methods on W2A4 quantization settings across tasks of language modeling, commonsense QA, and language understanding. These results demonstrate that extreme low-bit quantization of LLMs can be both practical and highly effective without introducing any extra high-precision channels or rotational matrices commonly used in recent PTQ-based works, offering a promising path toward efficient LLM deployment in resource-limited situations.