Search papers, labs, and topics across Lattice.
This paper analyzes the information-theoretic limits of quantizing dense linear layers and demonstrates that GPTQ can be far from optimal. To address this, they propose WaterSIC, a novel quantization algorithm that allocates different quantization rates to different columns of the weight matrix based on a waterfilling approach. WaterSIC achieves near-optimal performance, staying within 0.255 bits of the information-theoretic limit, and establishes new state-of-the-art results for quantizing Llama and Qwen LLMs at 1-4 bits.
GPTQ's quantization of LLMs is leaving performance on the table: WaterSIC closes the gap with an information-theoretically near-optimal approach that beats the state-of-the-art on Llama and Qwen.
This paper considers the problem of converting a given dense linear layer to low precision. The tradeoff between compressed length and output discrepancy is analyzed information theoretically (IT). It is shown that a popular GPTQ algorithm may have an arbitrarily large gap to the IT limit. To alleviate this problem, a novel algorithm, termed''WaterSIC'', is proposed and is shown to be within a rate gap of 0.255 bits to the IT limit, uniformly over all possible covariance matrices of input activations. The key innovation of WaterSIC's is to allocate different quantization rates to different columns (in-features) of the weight matrix, mimicking the classical IT solution known as''waterfilling''. Applying WaterSIC to the Llama and Qwen family of LLMs establishes new state-of-the-art performance for all quantization rates from 1 to 4 bits.