Search papers, labs, and topics across Lattice.
ETH Zurich
2
0
5
12
VLMs don't lack visual understanding of quantity, they just can't connect what they see to symbolic number representations, revealing a fractured magnitude space.
Multimodal LLMs often perform worse with more modalities because they struggle to jointly recognize and reason across modalities, a problem solvable with simple prompting strategies.