Tsinghua AICASD VAE for spatiotemporal latent encodingGalbotPKUSYSUFeb 12, 2026arXiv:2602.12215

LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

Jiangran Lyu, Xuheng Zhang, Yusen Feng, Wenxuan Zhu, Tingrui Shen, Jiayi Chen, Jiazhao Zhang, Xuesong Shi, Haoran Li, Ming-Yu Liu, Zhizheng Zhang, Yizhou Wang

AI Summary

The paper introduces LDA-1B, a robot foundation model that scales to 1B parameters by learning dynamics, policy, and visual forecasting from a new 30k-hour embodied interaction dataset (EI-30k) comprising diverse human and robot trajectories. LDA-1B leverages a structured DINO latent space for dynamics prediction to avoid pixel-space modeling and employs a multi-modal diffusion transformer to handle asynchronous vision and action streams. Experimental results demonstrate that LDA-1B outperforms existing methods on contact-rich, dexterous, and long-horizon tasks, while also enabling data-efficient fine-tuning by effectively utilizing low-quality trajectories.

Key Contribution

Training a robot foundation model on 30,000 hours of heterogeneous embodied data lets it outperform prior methods by up to 48% on complex manipulation tasks and even benefit from low-quality data.

Abstract

Recent robot foundation models largely rely on large-scale behavior cloning, which imitates expert actions but discards transferable dynamics knowledge embedded in heterogeneous embodied data. While the Unified World Model (UWM) formulation has the potential to leverage such diverse data, existing instantiations struggle to scale to foundation-level due to coarse data usage and fragmented datasets. We introduce LDA-1B, a robot foundation model that scales through universal embodied data ingestion by jointly learning dynamics, policy, and visual forecasting, assigning distinct roles to data of varying quality. To support this regime at scale, we assemble and standardize EI-30k, an embodied interaction dataset comprising over 30k hours of human and robot trajectories in a unified format. Scalable dynamics learning over such heterogeneous data is enabled by prediction in a structured DINO latent space, which avoids redundant pixel-space appearance modeling. Complementing this representation, LDA-1B employs a multi-modal diffusion transformer to handle asynchronous vision and action streams, enabling stable training at the 1B-parameter scale. Experiments in simulation and the real world show LDA-1B outperforms prior methods (e.g., $\pi_{0.5}$) by up to 21\%, 48\%, and 23\% on contact-rich, dexterous, and long-horizon tasks, respectively. Notably, LDA-1B enables data-efficient fine-tuning, gaining 10\% by leveraging 30\% low-quality trajectories typically harmful and discarded.

Data Curation & Synthetic Data Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References56

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

Related Papers