Search papers, labs, and topics across Lattice.
2
0
4
3
MLLMs can now efficiently process 10K-frame videos without training, by adaptively selecting tokens based on the model's own uncertainty about the content.
Forget expensive 3D training data: Loc3R-VLM shows how to give 2D vision-language models strong 3D spatial reasoning by distilling knowledge from a pretrained 3D foundation model using only monocular video.