Search papers, labs, and topics across Lattice.
3
0
7
Training a multimodal agent from scratch beats retrofitting existing LMMs with search tools, especially when you compress long interaction histories into visual summaries.
Current Composed Image Retrieval benchmarks are misleading, as a new evaluation reveals that models struggle with query ambiguity and interactive scenarios.
MLLMs can achieve near-identical performance on long-form visual tasks with just 2.5% of the original visual tokens by mimicking human visual attention.