Mar 27, 2026arXiv:2604.16391

Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining

Wenyao Zhang, Bozhou Zhang, Zekun Qi, Wenjun Zeng, Xin Jin, Li Zhang

AI Summary

The paper introduces DeFI, a framework for vision-language-action models that decouples visual forward dynamics pretraining (GFDM) from inverse dynamics pretraining (GIDM) to leverage diverse data sources. GFDM is pretrained on human and robot videos for future prediction, while GIDM infers latent actions from unlabeled video transitions via self-supervised learning. DeFI achieves state-of-the-art performance on CALVIN ABC-D and SimplerEnv benchmarks, demonstrating improved task completion and success rates, especially in real-world deployments.

Key Contribution

Robots learn better when they first imagine the future and then figure out how to act, unlocking SOTA performance by disentangling forward and inverse dynamics pretraining.

Abstract

Vision-language-action (VLA) models have shown great potential in building generalist robots, but still face a dilemma-misalignment of 2D image forecasting and 3D action prediction. Besides, such a vision-action entangled training manner limits model learning from large-scale, action-free web video data. To address these issues, we propose DeFI, a novel framework that Decouples visual Forward and Inverse dynamics pretraining to exploit respective data sources, wherein video generation and action prediction are disentangled. We introduce the General Forward Dynamics Model (GFDM), pretrained on diverse human and robot videos for future prediction, and the General Inverse Dynamics Model (GIDM), trained via self-supervised learning to infer latent actions from unlabeled video transitions. These models are then integrated into a unified architecture for end-to-end finetuning on downstream tasks. In this manner, GFDM and GIDM first shine separately and then cooperate for mutual benefit. Extensive experiments on CALVIN ABC-D and SimplerEnv demonstrate state-of-the-art performance, with DeFI achieving an average task length of 4.51 for CALVIN, 51.2% success rate on SimplerEnv-Fractal benchmark and 81.3% success rate in real-world deployment, significantly outperforming prior methods.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations1

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining

Related Papers