Search papers, labs, and topics across Lattice.
This paper addresses the inference slowdown observed when deploying non-uniformly quantized 3-bit Large Language Models (LLMs) due to dequantization overhead and GPU underutilization. They introduce Quantix, a framework that utilizes hardware-aligned bit shuffling and a fused dequantization-multiplication pipeline to convert memory savings into inference speedups. Experiments on NVIDIA L40 GPUs demonstrate that Quantix achieves average kernel-level speedups of 4.82× over FP16 cuBLAS and end-to-end speedups of up to 11.46× compared to existing quantization methods.
Naive quantization can paradoxically *slow down* LLM inference, but Quantix flips the script with 11x speedups via hardware-aware data layout and kernel fusion.
While Large Language Models (LLMs) are widely adopted, their massive parameter size constrains practical deployment. A common solution is clustering-based non-uniform quantization, which effectively compresses models to as low as 3 bits per weight while preserving high accuracy. However, instead of accelerating memory-bound LLM inference, the memory reduction paradoxically often causes a significant slowdown due to dequantization overhead and GPU underutilization. To address the issue, we propose Quantix, a framework designed to convert memory savings into inference speedups. Quantix applies two key optimizations: (1) a hardware-aligned bit shuffling scheme for efficient data access, and (2) a fused dequantization-multiplication pipeline that effectively maps workloads on both CUDA and Tensor Cores. Quantix enables high-throughput batched inference, delivering average kernel-level speedups of 4.82× over FP16 cuBLAS and end-to-end speedups of up to 11.46× over state-of-the-art quantization methods on NVIDIA L40 GPUs.