Search papers, labs, and topics across Lattice.
The paper introduces LEXI, a lossless exponent compression scheme based on Huffman coding, to reduce data movement overheads in LLM inference. LEXI compresses BF16 exponent streams, which exhibit low Shannon entropy, on the fly for activations and caches, and stores compressed weights for just-in-time decompression. Applied to Jamba, Zamba, and Qwen LLMs on a homogeneous chiplet architecture, LEXI achieves a 33-45% reduction in inter-chiplet communication and a 30-35% reduction in end-to-end inference latency.
By exploiting the low entropy of BF16 exponents with Huffman coding, LEXI slashes inter-chiplet communication latency in LLMs by up to 45% without sacrificing accuracy.
Data movement overheads increase the inference latency of state-of-the-art large language models (LLMs). These models commonly use the bfloat16 (BF16) format for stable training. Floating-point standards allocate eight bits to the exponent, but our profiling reveals that exponent streams exhibit fewer than 3 bits Shannon entropy, indicating high inherent compressibility. To exploit this potential, we propose LEXI, a novel lossless exponent compression scheme based on Huffman coding. LEXI compresses activations and caches on the fly while storing compressed weights for just-in-time decompression near compute, without sacrificing system throughput and model accuracy. The codecs at the ingress and egress ports of network-on-chip routers sustain the maximum link bandwidth via multi-lane LUT decoders, incurring only 0.09 percent area and energy overheads with GF 22 nm technology. LEXI reduces inter-chiplet communication and end-to-end inference latencies by 33-45 percent and 30-35 percent on modern Jamba, Zamba, and Qwen LLMs implemented on a homogeneous chiplet architecture.