Mar 3, 2026arXiv:2603.03485

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Shang Wu, Jianshu Zhang, Maojiang Su, Guo Ye, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

AI Summary

Phys4D is introduced as a three-stage pipeline to enhance the physical consistency of 4D world representations learned from video diffusion models. The pipeline involves pseudo-supervised pretraining for geometry and motion, physics-grounded supervised fine-tuning using simulated data, and simulation-grounded reinforcement learning to correct residual physical violations. A new 4D world consistency evaluation suite is introduced to assess geometric coherence, motion stability, and long-horizon physical plausibility, demonstrating Phys4D's superior performance compared to appearance-driven baselines.

Key Contribution

Video diffusion models can now generate physically plausible 4D worlds thanks to a new pipeline that combines pretraining, supervised fine-tuning, and reinforcement learning.

Abstract

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

Computer Vision Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Related Papers