Search papers, labs, and topics across Lattice.
3
0
6
11
Multimodal models can now handle audio natively with improved efficiency, achieving state-of-the-art results in complex tasks like document understanding and agentic computer use.
Current multimodal models are surprisingly bad at understanding long, complex videos, struggling to integrate audio, visual, and text cues even for basic reasoning tasks.
Current multimodal LLMs choke on long-form video understanding, either forgetting details or getting lost in the timeline, but a new agentic architecture with dynamic memory management offers a promising fix.