Search papers, labs, and topics across Lattice.
3
0
4
3
MLLMs are failing to visually track events in videos, performing only modestly above baseline despite strong results on other benchmarks.
Existing image editing models struggle with precision, achieving only 17.1% accuracy on a new benchmark designed to evaluate fundamental visual editing tasks.
Vision models are far more data-hungry than language models, but Mixture-of-Experts can harmonize this asymmetry for truly unified multimodal models.