Search papers, labs, and topics across Lattice.
5
0
9
Training a multimodal agent from scratch beats retrofitting existing LMMs with search tools, especially when you compress long interaction histories into visual summaries.
Current Composed Image Retrieval benchmarks are misleading, as a new evaluation reveals that models struggle with query ambiguity and interactive scenarios.
ALMs can now pinpoint sounds in time with far greater accuracy, thanks to a new training method that stops them from hallucinating timestamps.
MLLMs can achieve near-identical performance on long-form visual tasks with just 2.5% of the original visual tokens by mimicking human visual attention.
Soccer tactics, previously viewed as too stochastic for accurate modeling, can now be realistically simulated with a diffusion model that captures nuanced team styles and predicts future outcomes.