Search papers, labs, and topics across Lattice.
2
0
4
Multimodal LLMs still struggle to faithfully recreate webpages from videos, particularly in capturing fine-grained style and motion, despite advances in other areas.
VLMs can't count blocks because they lack a view-consistent spatial interface, but decomposing scenes into orthographic projections fixes it.