Search papers, labs, and topics across Lattice.
2
0
5
The field of video understanding is rapidly shifting from isolated pipelines to unified models capable of adapting to diverse downstream tasks, demanding a re-evaluation of current approaches.
MLLMs can "hear" a little, but EgoSound reveals they're still largely deaf to the nuances of sound in egocentric video, especially when it comes to spatial and causal reasoning.