Search papers, labs, and topics across Lattice.
4
0
5
12
MLLMs are failing to visually track events in videos, performing only modestly above baseline despite strong results on other benchmarks.
Existing image editing models struggle with precision, achieving only 17.1% accuracy on a new benchmark designed to evaluate fundamental visual editing tasks.
Camera pose, largely ignored in video LLMs, unlocks significant gains in spatial reasoning and even improves general video QA when used as a lightweight supervisory signal.
Image generators aren't just for making pretty pictures; they're secretly state-of-the-art vision learners, rivaling specialized models in tasks from segmentation to depth estimation.