Search papers, labs, and topics across Lattice.
2
0
4
1
Audio-language models can now reason about 30-minute-long audio clips with timestamp-grounded intermediate steps, unlocking a new level of fine-grained understanding.
AVLLMs may "hear" at intermediate layers, but they largely ignore audio cues in favor of vision when generating text, revealing a fundamental modality bias.