Search papers, labs, and topics across Lattice.
2
0
5
15
LongCat-Next shatters the language-centric paradigm by unifying text, vision, and audio into a single autoregressive model with minimal modality-specific design, finally reconciling understanding and generation in discrete vision modeling.
Today's best multimodal models can only solve half of compositional visual tool-use tasks, revealing a critical gap in their ability to plan and execute complex, multi-step visual reasoning.