HITHKUSTMar 18, 2026arXiv:2603.17435

ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

Ruibo Fan, Xiangrui Yu, Xinglin Pan, Zeyu Li, Weile Luo, Qiang Wang, Wei Wang, Xiao-Xia Chu, Xiaowen Chu

AI Summary

ZipServ introduces a hardware-aware lossless compression framework for LLM serving, addressing memory and bandwidth bottlenecks by co-designing compression with GPU architecture. It uses Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE) for parallel decoding and a fused decompression-GEMM (ZipGEMM) kernel to decompress weights directly into Tensor Core registers. Experiments demonstrate up to 30% model size reduction, 2.21x kernel speedup over cuBLAS, and 1.22x end-to-end inference speedup over vLLM.

Key Contribution

Lossless compression can actually *speed up* LLM inference on GPUs, not just shrink model size, thanks to ZipServ's hardware-aware design.

Abstract

Lossless model compression holds tremendous promise for alleviating the memory and bandwidth bottlenecks in bit-exact Large Language Model (LLM) serving. However, existing approaches often result in substantial inference slowdowns due to fundamental design mismatches with GPU architectures: at the kernel level, variable-length bitstreams produced by traditional entropy codecs break SIMT parallelism; at the system level, decoupled pipelines lead to redundant memory traffic. We present ZipServ, a lossless compression framework co-designed for efficient LLM inference. ZipServ introduces Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE), a novel fixed-length format that enables constant-time, parallel decoding, together with a fused decompression-GEMM (ZipGEMM) kernel that decompresses weights on-the-fly directly into Tensor Core registers. This '' load-compressed, compute-decompressed '' design eliminates intermediate buffers and maximizes compute intensity. Experiments show that ZipServ reduces the model size by up to 30%, achieves up to 2.21× kernel-level speedup over NVIDIA's cuBLAS, and expedites end-to-end inference by an average of 1.22× over vLLM. ZipServ is the first lossless compression system that provides both storage savings and substantial acceleration for LLM inference on GPUs.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References80

Year2026

VenueInternational Conference on Architectural Support for Programming Languages and Operating Systems

Related Papers

Finding related papers...

Search

ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

Related Papers