Search papers, labs, and topics across Lattice.
2
0
3
9
Ditching the vision encoder actually *improves* multimodal understanding at scale, proving that pixel embeddings alone can achieve state-of-the-art results in unified multimodal models.
Vision models are far more data-hungry than language models, but Mixture-of-Experts can harmonize this asymmetry for truly unified multimodal models.