KAISTMar 31, 2026arXiv:2603.29409

CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics

Andrew Jeong, Jaemin Kim, Sebin Lee, Sung-Eui Yoon

AI Summary

The paper introduces CLaD, a framework for robotic manipulation that explicitly aligns kinematic and semantic transitions by modeling their joint evolution using asymmetric cross-attention. CLaD predicts grounded latent foresights through self-supervised learning with EMA target encoders and reconstruction losses, effectively anchoring predictions to observable states. Experiments on the LIBERO-LONG benchmark demonstrate that CLaD achieves a 94.7% success rate, rivalling large Vision-Language models while using significantly fewer parameters.

Key Contribution

Achieve state-of-the-art robotic manipulation with a model orders of magnitude smaller than VLMs by explicitly aligning kinematic and semantic transitions.

Abstract

Robotic manipulation involves kinematic and semantic transitions that are inherently coupled via underlying actions. However, existing approaches plan within either semantic or latent space without explicitly aligning these cross-modal transitions. To address this, we propose CLaD, a framework that models how proprioceptive and semantic states jointly evolve under actions through asymmetric cross-attention that allows kinematic transitions to query semantic ones. CLaD predicts grounded latent foresights via self-supervised objectives with EMA target encoders and auxiliary reconstruction losses, preventing representation collapse while anchoring predictions to observable states. Predicted foresights are modulated with observations to condition a diffusion policy for action generation. On LIBERO-LONG benchmark, CLaD achieves 94.7\% success rate, competitive with large VLAs with significantly fewer parameters.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics

Related Papers