Search papers, labs, and topics across Lattice.
2 papers published across 1 lab.
Current video understanding models struggle with long-horizon robustness and non-speech audio, as revealed by the new OmniPro benchmark designed for comprehensive omni-modal proactive evaluation.
Multimodal LLMs struggle to pinpoint objects from nouns alone, but SWIM training realigns vision and language to outperform visual-prompt methods.