Search papers, labs, and topics across Lattice.
1
0
3
Current multimodal models are surprisingly bad at understanding long, complex videos, struggling to integrate audio, visual, and text cues even for basic reasoning tasks.