Search papers, labs, and topics across Lattice.
6
0
8
9
Zero-shot synthesis of articulated human-object interactions is now possible by treating diffusion-generated videos as supervision for 4D scene reconstruction, unlocking physically grounded interactions beyond rigid manipulation.
Unified multimodal models often *hurt* performance on multimodal understanding tasks, except for spatial reasoning, visual illusions, and multi-round reasoning, challenging the assumption that generation universally improves understanding.
Achieve 10% higher success rates in robotic manipulation tasks while speeding up inference by 1.5-1.8x by intelligently pruning visual tokens in multi-view Vision-Language-Action models.
A 1000x larger video reasoning dataset reveals early signs of emergent generalization, offering a new foundation for training and evaluating spatiotemporal AI.
Achieve SOTA joint audio-video generation with JavisDiT++ using just 1M public training examples, rivaling performance of models trained on proprietary datasets.
Turns out, skipping the boring parts of a video (like static backgrounds) makes your vision AI both faster and smarter, beating state-of-the-art models with less data.