Search papers, labs, and topics across Lattice.
The paper introduces Omni, a unified multimodal model trained across text, images, videos, 3D geometry, and hidden representations. Omni exhibits "Context Unrolling," reasoning across multiple modal representations to aggregate information and improve reasoning fidelity. This approach leads to strong performance on multimodal generation and understanding benchmarks, showcasing advanced in-context generation across diverse modalities.
Training a single model across text, images, video, 3D geometry, and hidden representations unlocks "Context Unrolling," where the model reasons across modalities to improve reasoning fidelity.
We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.