Search papers, labs, and topics across Lattice.
1
3
LLMs can maintain near-perfect accuracy on long sequences with only 25% of the KV cache, thanks to a novel semantic clustering approach that dramatically improves CPU-GPU offloading.