Search papers, labs, and topics across Lattice.
4
0
5
3
V2A models prioritize text captions over visual cues when generating audio, resulting in physically plausible but often temporally misaligned sounds.
Video LLMs can ace individual traffic video questions but still fail spectacularly at subtle counterfactual reasoning, revealing a critical blind spot for safety-critical applications.
Unified benchmarks reveal the state-of-the-art in simultaneously addressing multiple real-world image degradations like blur, low-light, and rain.
Audio-language models can now reason about 30-minute-long audio clips with timestamp-grounded intermediate steps, unlocking a new level of fine-grained understanding.