Search papers, labs, and topics across Lattice.
4
0
8
10
Current MLLMs are surprisingly bad at understanding human intent in egocentric videos at a step-by-step level, achieving only 33% accuracy on a new benchmark designed to prevent future-frame leakage.
Pre-trained video diffusion models can be deterministically adapted into state-of-the-art zero-shot depth estimators, sidestepping the need for massive labeled datasets.
Forget short looping animations – this new diffusion model generates hour-long, real-time human animations with lip-sync accuracy and emotional expressiveness, all while running on just two GPUs.
The first comprehensive survey of Visual Document Retrieval reveals how MLLMs are reshaping the field, highlighting the shift towards RAG and agentic systems for complex document understanding.