Search papers, labs, and topics across Lattice.
4
0
5
0
Video-LLMs can be sped up by nearly 3x without sacrificing performance, simply by loosening the strict matching requirements of speculative decoding and focusing on visual-semantic relevance.
Visual token dominance is the hidden culprit behind LVLM inference inefficiency, and this paper dissects the problem to reveal how to navigate the fidelity-efficiency tradeoff.
MLLMs can achieve up to 7.9x KV cache compression and 1.52x faster decoding without sacrificing performance by intelligently compressing different attention heads with distinct strategies.
Streaming 3D reconstruction gets a free lunch: MeMix, a training-free module, slashes reconstruction errors by up to 40% by selectively updating memory patches, fighting catastrophic forgetting without extra parameters.