Search papers, labs, and topics across Lattice.
This paper investigates multimodal pretraining by training models from scratch using the Transfusion framework with next-token prediction for language and diffusion for vision. They identify Representation Autoencoders (RAE) as optimal for unified visual representation and demonstrate the complementary nature of visual and language data. Key findings include the emergence of world modeling capabilities and the effectiveness of Mixture-of-Experts (MoE) for efficient multimodal scaling, revealing a data scaling asymmetry where vision is more data-hungry than language.
Vision models are far more data-hungry than language models, but Mixture-of-Experts can harmonize this asymmetry for truly unified multimodal models.
The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.