Search papers, labs, and topics across Lattice.
4
0
6
11
Current MLLMs struggle with fine-grained spatial reasoning, achieving only 37.2 F1 on challenging tasks compared to human performance of 84.0 F1.
Foundation models struggle with spatial tasks, achieving only 12% success in reproducing target viewpoints, but a novel post-training framework boosts performance to over 51%.
Generative training not only enhances a model's ability to manipulate objects in images, but also surprisingly strengthens its spatial reasoning skills.
OmniJigsaw reveals a "bi-modal shortcut phenomenon" in joint audio-visual integration, demonstrating that naive fusion can be surprisingly ineffective and highlighting the importance of carefully designed cross-modal training strategies.