Search papers, labs, and topics across Lattice.
3
0
6
0
Surprisingly, general-purpose vision models already contain better action representations for robotic control than specialized embodied models trained explicitly for that purpose.
LongCat-Next shatters the language-centric paradigm by unifying text, vision, and audio into a single autoregressive model with minimal modality-specific design, finally reconciling understanding and generation in discrete vision modeling.
Instruction-based image editing models still struggle to edit small objects, with a new benchmark revealing significant performance gaps despite progress on existing benchmarks.