Search papers, labs, and topics across Lattice.
Zhejiang University, Westlake University
2
0
5
0
Video-LLMs can achieve up to 2.65x faster time-to-first-token and 61% FLOPs reduction by compressing visual tokens *inside* the vision encoder, not just after.
Current OmniLLMs stumble when processing real-world, long-form audio-visual content, achieving only ~35-65% accuracy on a new benchmark designed to test long-term memory and fine-grained understanding.