Search papers, labs, and topics across Lattice.
2
0
4
7
V2A models prioritize text captions over visual cues when generating audio, resulting in physically plausible but often temporally misaligned sounds.
Audio-language models can now reason about 30-minute-long audio clips with timestamp-grounded intermediate steps, unlocking a new level of fine-grained understanding.