Search papers, labs, and topics across Lattice.
2
0
3
0
MLLMs aren't blind and deaf because they can't see or hear, but because their text-trained decoders are ignoring most of what the encoders pass along.
Audio-language models trust text 10x more than audio when the modalities conflict, even when the audio is more accurate, revealing a fundamental asymmetry in how these models reason about different input types.