Jun 6, 2026arXiv:2606.08242

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Ziang Li, Dongzhou Cheng, Yibin Wang, Shiyu Wang, Xiaoyang Xu, Lingxuan Weng, Juan Wang, Jiaqi Wang

AI Summary

This paper introduces Light-WAM, a lightweight World Action Model designed to enhance robot manipulation by integrating future prediction into policy learning while minimizing computational costs. By employing a compact video backbone and downsampled latent space for future-video supervision, Light-WAM significantly reduces training and inference latency compared to traditional large-scale generative models. Experimental results show that Light-WAM achieves competitive performance on the LIBERO benchmark and demonstrates effective multi-task capabilities on RoboTwin 2.0, all while utilizing only 0.44B trainable parameters and achieving low inference latency.

Key Contribution

Light-WAM achieves high-performance robot manipulation with just 0.44B parameters, revolutionizing the efficiency of World Action Models.

Abstract

World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.

Robotics & Embodied AI Training Efficiency & Optimization World Models & Planning

Citation Metrics

Citations0

Influential citations0

References38

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Related Papers