Search papers, labs, and topics across Lattice.
3
0
5
4
MLLMs can now efficiently process 10K-frame videos without training, by adaptively selecting tokens based on the model's own uncertainty about the content.
Forget expensive 3D training data: Loc3R-VLM shows how to give 2D vision-language models strong 3D spatial reasoning by distilling knowledge from a pretrained 3D foundation model using only monocular video.
Video Language Models can achieve up to 86% faster time-to-first-token and 93% token reduction by ditching full-image encoding in favor of motion vectors and residuals from video codecs.