Search papers, labs, and topics across Lattice.
This paper introduces LC-QAT, a novel 2-bit quantization-aware training framework that leverages linear-constrained vector quantization to optimize large language models. By employing a learned affine mapping for quantized weights, LC-QAT enables fully differentiable end-to-end training without the need for discrete codebook lookups, resulting in a highly efficient initialization for post-training quantization. Experimental results show that LC-QAT outperforms existing state-of-the-art methods while requiring only 0.1% to 10% of the training data, highlighting its practicality for deploying extremely low-bit models.
LC-QAT achieves superior performance in 2-bit quantization with just a fraction of the training data, setting a new standard for data-efficient model optimization.
Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (SQ), which enables efficient optimization but suffers from severe performance degradation at 2-bit precision. On the other hand, vector quantization (VQ) provides substantially higher representational capacity, but its discrete codebook lookup prevents end-to-end training. We propose LC-QAT, a 2-bit weight-only VQ-QAT framework that represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. This strong post-training initialization makes LC-QAT highly data-efficient. Experiments across diverse LLMs demonstrate that LC-QAT consistently outperforms state-of-the-art QAT methods while using only 0.1%--10% of the training data. Our results establish LC-QAT as a practical and scalable solution for extreme low-bit model deployment.