Search papers, labs, and topics across Lattice.
3
0
6
11
Multimodal models can now achieve state-of-the-art performance in real-world tasks like document understanding and audio-video comprehension with significantly reduced inference latency thanks to novel token-reduction techniques.
Current multimodal models are surprisingly bad at understanding long, complex videos, struggling to integrate audio, visual, and text cues even for basic reasoning tasks.
Current multimodal LLMs choke on long-form video understanding, either forgetting details or getting lost in the timeline, but a new agentic architecture with dynamic memory management offers a promising fix.