KRAFTONUW-MadisonMay 5, 2026arXiv:2605.03269

RLDX-1 Technical Report

Dongyoung Kim, Huiwon Jang, Myungkyu Koo, Suhyeok Jang, Taeyoung Kim, Beomjun Kim, Byung-Jun Yoon, C. Jang, Daewon Choi, Dongsu Han, Donguk Lee, H. Kwon, Hojin Jeon, Jaehyun Kang, Jae-sung Bae, Jihyuk Lee, Jimin Lee, John Won, Joonwoo Ahn, Junhyeon Park, Junyoung Sung, Kyungmin Lee, Minseong Han, MinSung Yoon, S. Joo, Seonil Son, Seungcheol Park, Seung-Mo Cho, Seungjun Moon, Seungku Kim, Yong Dong, Yongjin Cho, Youngchan Kim, Chang Hwan Kim, Dohyeong Kim, Hazel Lee, Heecheol Kim, H. Ahn, H. Ryu, H. Choi, Hyunsoo Shin, Jaeheon Jung, Jaewoo Kim, Jinwook Kim, Jo-Ping Chang, J. Park, Jungwoo Park, J. Cho, Junhyeok Park, Junwon Lee, Kangwook Lee, Kwang-Hoe Kim, K. Choe, Manoj Bhadu, Nayoung Oh, Sangjun Kim, Sangwoo Kim, Seung-tae Shim, Seunghyun Kim, Seungjun Lee, Seungyup Ka, Sung-Po Yang, W. Jung, Yash Shukla, Yeonjae Lee, Y. Bae, Jinwoo Shin

AI Summary

The authors introduce RLDX-1, a vision-language-action model for dexterous manipulation that integrates motion awareness, memory, and physical sensing. RLDX-1 uses a Multi-Stream Action Transformer (MSAT) architecture to unify heterogeneous modalities through modality-specific streams with cross-modal joint self-attention, combined with synthesized training data and inference optimizations. Empirical results demonstrate that RLDX-1 outperforms state-of-the-art VLAs like $\pi_{0.5}$ and GR00T N1.6 in both simulation and real-world tasks, particularly excelling in ALLEX humanoid tasks with significantly higher success rates.

Key Contribution

RLDX-1 achieves double the success rate of existing VLAs on complex humanoid tasks, suggesting a leap in robots' ability to handle contact-rich, dynamic manipulation.

Abstract

While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complex real-world tasks requiring broader functional capabilities (e.g. motion awareness, memory-aware decision making, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. $\pi_{0.5}$ and GR00T N1.6) across both simulation benchmarks and real-world tasks that require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while $\pi_{0.5}$ and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoF humanoid robot under diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-world dexterous manipulation.

Multimodal Models Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

RLDX-1 Technical Report

Related Papers