Search papers, labs, and topics across Lattice.
3
0
6
0
LMMs struggle to ground text queries in the right parts of images, but explicitly modeling salient visual subjects can dramatically improve cross-modal retrieval.
MLLMs can gain surprisingly strong 3D spatial reasoning abilities simply by tapping into the latent knowledge already present in video generation models.
Achieve a remarkable 12.4x speedup in 3D reconstruction by mimicking the efficiency of keypoint matching with a novel dual-branch attention mechanism.