Search papers, labs, and topics across Lattice.
Renmin University of China
2
0
5
Current video understanding models struggle with long-horizon robustness and non-speech audio, as revealed by the new OmniPro benchmark designed for comprehensive omni-modal proactive evaluation.
By explicitly modeling speech, SAVE leapfrogs existing audio-visual methods for video-text retrieval, achieving substantial gains over the state-of-the-art.