Search papers, labs, and topics across Lattice.
2
0
4
6
Ditching the vision encoder actually *improves* multimodal understanding at scale, proving that pixel embeddings alone can achieve state-of-the-art results in unified multimodal models.
Current VLM spatial reasoning benchmarks are misleading, as they often penalize models for "incorrect" answers that are actually correct given the limited visual information the models receive.