Search papers, labs, and topics across Lattice.
The Qwen-RobotWorld framework integrates language-conditioned video generation to enhance embodied intelligence across various robotic tasks, including manipulation and navigation. By utilizing a double-stream diffusion transformer and a comprehensive video-text corpus, the model predicts future visual trajectories grounded in physical interactions, enabling effective language-guided planning and synthetic data generation. Results demonstrate its superiority, achieving top rankings on multiple benchmarks and showcasing robust generalization capabilities in zero-shot scenarios.
Language-driven video generation in Qwen-RobotWorld achieves unprecedented accuracy in predicting robotic actions, outperforming existing models across key benchmarks.
We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.