B trainable parametersMondo RoboticsMar 11, 2026arXiv:2603.10448

DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

Teli Ma, Jiayu Zheng, Zifan Wang, Chuili Jiang, Andrew Cui, Junwei Liang, Shuo Yang

AI Summary

DiT4DiT, a novel Video-Action Model, couples a video Diffusion Transformer with an action Diffusion Transformer in a cascaded framework to improve robot learning. It extracts intermediate denoising features from video generation as temporally grounded conditions for action prediction, avoiding reliance on reconstructed future frames. A dual flow-matching objective with decoupled timesteps and noise scales enables coherent joint training, achieving SOTA results on LIBERO and RoboCasa GR1 with significantly less data, and demonstrating strong zero-shot generalization on a Unitree G1 robot.

Key Contribution

By jointly modeling video dynamics and actions, DiT4DiT achieves 10x sample efficiency and 7x faster convergence in robot policy learning, showing that video generation can be a powerful scaling proxy.

Abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for robot learning, but their representations are still largely inherited from static image-text pretraining, leaving physical dynamics to be learned from comparatively limited action data. Generative video models, by contrast, encode rich spatiotemporal structure and implicit physics, making them a compelling foundation for robotic manipulation. But their potentials are not fully explored in the literature. To bridge the gap, we introduce DiT4DiT, an end-to-end Video-Action Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified cascaded framework. Instead of relying on reconstructed future frames, DiT4DiT extracts intermediate denoising features from the video generation process and uses them as temporally grounded conditions for action prediction. We further propose a dual flow-matching objective with decoupled timesteps and noise scales for video prediction, hidden-state extraction, and action inference, enabling coherent joint training of both modules. Across simulation and real-world benchmarks, DiT4DiT achieves state-of-the-art results, reaching average success rates of 98.6% on LIBERO and 50.8% on RoboCasa GR1 while using substantially less training data. On the Unitree G1 robot, it also delivers superior real-world performance and strong zero-shot generalization. Importantly, DiT4DiT improves sample efficiency by over 10x and speeds up convergence by up to 7x, demonstrating that video generation can serve as an effective scaling proxy for robot policy learning. We release code and models at https://dit4dit.github.io/.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References49

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

Related Papers