Search papers, labs, and topics across Lattice.
3
0
4
5
MLLMs can revolutionize video understanding by integrating watching, remembering, and reasoning into a cohesive framework that addresses long-range dependencies and sparse evidence.
Achieving a 43.65% Effective Temporal F1 score, this work reveals that MLLMs can be effectively adapted for complex One-to-Many Temporal Grounding tasks, challenging the limitations of previous models.
Despite impressive headline scores, today's best video MLLMs can't reliably ground their answers in space and time, achieving <1% accuracy when required to identify the spatio-temporal evidence supporting their predictions.