Search papers, labs, and topics across Lattice.
3
0
6
Attention Sink, where Transformers fixate on seemingly irrelevant tokens, is more than just a quirk – it's a fundamental challenge impacting training, inference, and even causing hallucinations, demanding a systematic approach to understanding and mitigating its effects.
Achieve better compression in low-bit quantization by considering not just numerical sensitivity, but also the structural role of each layer.
By pruning and quantizing the KV cache, XStreamVGGT achieves a remarkable 4.42x memory reduction and 5.48x speedup in streaming 3D reconstruction without sacrificing performance.