Search papers, labs, and topics across Lattice.
×10−41\times 10^{-4}, weight decay of 0.01, and a cosine annealing schedule for all training settings. Training is conducted in mixed precision with gradient clipping at 1.0. All latency numbers are measured on a single NVIDIA RTX, Act: human video generation in novel scenarios enables generalizable robot manipulation. arXiv preprint arXiv:2409.16283. Cited by: §2. [2] H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y. Feng, C. Xiang, Y. Rong, H. Zhao, H. Liu, Z. Su, L. Ma, H. Su, and J. Zhu (2025) Motus: a unified latent action world model. External Links: 2512.13030, Link Cited by: §1, §1, §2, §3.3, §4.2, §4.3.1, Table 1, Table 2. [3] J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y. Fang, D. Fox, F. Hu, S. Huang, et al. (2025) Gr00t n1: an open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734. Cited by: §2. [4] K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. (2024) π0\pi_{0}: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164. Cited by: §2, Table 1, Table 2. [5] Q. Bu, J. Cai, L. Chen, X. Cui, Y. Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang, et al. (2025) Agibot world colosseo: a large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669. Cited by: §2. [6] J. Cen, S. Huang, Y. Yuan, K. Li, H. Yuan, C. Yu, Y. Jiang, J. Guo, X. Li, H. Luo, F. Wang, F. Wang, and D. Zhao (2025) RynnVLA-002: a unified vision-language-action and world model. arXiv preprint arXiv:2511.17502. Cited by: §2. [7] J. Cen, C. Yu, H. Yuan, Y. Jiang, S. Huang, J. Guo, X. Li, Y. Song, H. Luo, F. Wang, D. Zhao, and H. Chen (2025) WorldVLA: towards autoregressive action world model. arXiv preprint arXiv:. Cited by: §2. [8] C. Cheang, G. Chen, Y. Jing, T. Kong, H. Li, Y. Li, Y. Liu, H. Wu, J. Xu, Y. Yang, H. Zhang, and M. Zhu (2024) GR-2: a generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158. Cited by: §2. [9] T. Chen, Z. Chen, B. Chen, Z. Cai, Y. Liu, Z. Li, Q. Liang, X. Lin, Y. Ge, Z. Gu, et al. (2025) Robotwin 2.0: a scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088. Cited by: §4.2. [10] Y. Du, M. Yang, B. Dai, H. Dai, O. Nachum, J. B. Tenenbaum, D. Schuurmans, and P. Abbeel (2023) Learning universal policies via text-guided video generation. External Links: 2302.00111, Link Cited by: §1, §2, §3.3. [11] Y. Feng, H. Tan, X. Mao, C. Xiang, G. Liu, S. Huang, H. Su, and J. Zhu (2025) Vidar: embodied video diffusion model for generalist manipulation. External Links: 2507.12898, Link Cited by: §1, §2, §3.3. [12] Y. Hu, Y. Guo, P. Wang, X. Chen, Y. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen (2024) Video prediction policy: a generalist robot policy with predictive visual representations. arXiv preprint arXiv:2412.14803. Cited by: §2. [13] P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. (2025) pi0.5pi_{0.5}: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054. Cited by: §2, Table 1, Table 2. [14] J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y. Fang, F. Hu, S. Huang, K. Kundalia, Y. Lin, L. Magne, A. Mandlekar, A. Narayan, Y. L. Tan, G. Wang, J. Wang, Q. Wang, Y. Xu, X. Zeng, K. Zheng, R. Zheng, M. Liu, L. Zettlemoyer, D. Fox, J. Kautz, S. Reed, Y. Zhu, and L. Fan (2025) DreamGen: unlocking generalization in robot learning through video world models. External Links: 2505.12705, Link Cited by: §2. [15] M. J. Kim, Y. Gao, T. Lin, Y. Lin, Y. Ge, G. Lam, P. Liang, S. Song, M. Liu, C. Finn, and J. Gu (2026) Cosmos policy: fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163. Cited by: §2. [16] M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. (2024) Openvla: an open-source vision-language-action model. arXiv preprint arXiv:2406.09246. Cited by: §2, Table 2. [17] L. Li, Q. Zhang, Y. Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, Y. Shen, and Y. Xu (2026) Causal world modeling for robot control. arXiv preprint arXiv:2601.21998. Cited by: §1, §1, §2, §2, §3.3, §4.1, §4.2, §4.3.1, Table 1, Table 2. [18] S. Li, Y. Gao, D. Sadigh, and S. Song (2025) Unified video action model. arXiv preprint arXiv:2503.00200. Cited by: §2. [19] J. Liang, P. Tokmakov, R. Liu, S. Sudhakar, P. Shah, R. Ambrus, and C. Vondrick (2025) Video generators are robot policies. arXiv preprint arXiv:2508.00795. Cited by: §1, §2. [20] Y. Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y. Jiang, Y. Hu, J. Cai, S. Liu, J. Luo, L. Chen, S. Yan, M. Yao, and G. Ren (2025) Genie envisioner: a unified world foundation platform for robotic manipulation. arXiv preprint arXiv:2508.05635. Cited by: §2. [21] B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone (2023) LIBERO: benchmarking knowledge transfer for lifelong robot learning. arXiv preprint arXiv:2306.03310. Cited by: §4.2. [22] S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu (2024) Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864. Cited by: §2. [23] J. Pai, L. Achenbach, V. Montesinos, B. Forrai, O. Mees, and E. Nava (2025) Mimic-video: video-action models for generalizable robot control beyond vlas. arXiv preprint 2512.15692. Cited by: §1, §2. [24] M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. (2025) Smolvla: a vision-language-action model for affordable and efficient robotics. arXiv preprint arXiv:2506.01844. Cited by: §2. [25] G. Team (2025) Galaxea g0: open-world dataset and dual-system vla model. arXiv preprint arXiv:2509.00576v1. Cited by: §2. [26] G. R. Team, S. Abeyruwan, J. Ainslie, J. Alayrac, M. G. Arenas, T. Armstrong, A. Balakrishna, R. Baruch, M. Bauza, M. Blokzijl, et al. (2025) Gemini robotics: bringing ai into the physical world. arXiv preprint arXiv:2503.20020. Cited by: §2. [27] T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X. Huang, X. Xu, Y. Kou, Y. Lv, Y. Li, Y. Liu, Y. Wang, Y. Zhang, Y. Huang, Y. Li, Y. Wu, Y. Liu, Y. Pan, Y. Zheng, Y. Hong, Y. Shi, Y. Feng, Z. Jiang, Z. Han, Z. Wu, and Z. Liu (2025) Wan: open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314. Cited by: §3.2, §4.1, §4.1. [28] J. Wen, Y. Zhu, J. Li, Z. Tang, C. Shen, and F. Feng (2025) Dexvla: vision-language model with plug-in diffusion expert for general robot control. arXiv preprint arXiv:2502.05855. Cited by: §2. [29] J. Won, K. Lee, H. Jang, D. Kim, and J. Shin (2025) Dual-stream diffusion for world-model augmented vision-language-action model. External Links: 2510.27607, Link Cited by: §2. [30] H. Wu, Y. Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong (2023) Unleashing large-scale video generative pre-training for visual robot manipulation. External Links: 2312.13139
NVIDIA Research1
0
3
5
Current multimodal LLMs choke on long-form video understanding, either forgetting details or getting lost in the timeline, but a new agentic architecture with dynamic memory management offers a promising fix.