HKUJun 10, 2026arXiv:2606.12217

Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

Lu Qiu, Yizhuo Li, Yi Chen, Yuying Ge, Yixiao Ge, Xihui Liu

AI Summary

This paper addresses the limitations of World Action Models (WAMs) in robot manipulation by identifying a representation mismatch that hinders accurate action extraction despite plausible visual future generation. Through action-head attention analysis and causal interventions, the authors reveal that the action decoder's focus is misaligned, leading to sensitivity to irrelevant perturbations. They introduce the Action-Grounded Representation Alignment (AGRA) objective, which enhances the alignment of video diffusion features with semantic representations, resulting in improved object localization and robustness in manipulation tasks.

Key Contribution

Action-Grounded Representation Alignment (AGRA) transforms how robots interpret visual data, enabling them to focus on crucial interaction regions and significantly enhancing manipulation performance.

Abstract

World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action-head attention analysis and causal interventions. We find that the action decoder fails to focus on task-relevant interaction regions and remains sensitive to perturbations in task-irrelevant areas. This reveals a representation mismatch: hidden states optimized for visual reconstruction are not inherently organized in a form useful for low-level action control. In this paper, we propose AGRA, an Action-Grounded Representation Alignment objective that regularizes the world-action interface by aligning intermediate video diffusion features with spatially coherent semantic representations from a foundation visual encoder. We evaluate AGRA on real-world manipulation tasks. Experiments show that AGRA makes world model representations more action-grounded: by focusing the action decoder on the correct interaction regions, it improves object localization accuracy and affordance understanding, and makes the policy more robust to perturbations in task-irrelevant regions. As a result, AGRA consistently improves both in-distribution performance and out-of-distribution generalization over the baseline world action model.

Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References66

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

Related Papers