Search papers, labs, and topics across Lattice.
New York University
2
0
4
MLLMs are failing to visually track events in videos, performing only modestly above baseline despite strong results on other benchmarks.
Camera pose, largely ignored in video LLMs, unlocks significant gains in spatial reasoning and even improves general video QA when used as a lightweight supervisory signal.