Division of Intramural ResearchNaMonal InsMtutes of HealthNaMonal Library of MedicineApr 27, 2026arXiv:2604.24764

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

Weijie Wang, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y. Chen, Zhiyuan He, Yuqing Yang, Bohan Zhuang

AI Summary

World-R1 uses reinforcement learning to improve the 3D consistency of text-to-video generation without modifying the underlying architecture. They create a text dataset for world simulation and use Flow-GRPO to optimize the model based on feedback from pre-trained 3D and vision-language models. Results show improved 3D consistency and maintained visual quality, advancing video generation towards scalable world simulation.

Key Contribution

Text-to-video models can now learn geometrically consistent world dynamics via reinforcement learning, without expensive architectural changes.

Abstract

Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

Computer Vision Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

Related Papers