May 26, 2026arXiv:2605.26649

On the Generalization Capabilities, Design Choices and Limitations of Keypoint Imitation Learning

Thomas Lips, Marco Moletta, Michael C. Welle, Danica Kragic, Francis Wyffels

AI Summary

This paper investigates Keypoint Imitation Learning (KIL) for robotic manipulation, focusing on design choices and generalization capabilities compared to RGB-based and diffusion-based methods. They systematically evaluate KIL across five real-world tasks using over 2000 rollouts, exploring the impact of different design choices. Results show KIL significantly outperforms RGB baselines (75% vs 47% success rate) and performs comparably to S2-diffusion (73%), while also highlighting limitations inherited from the underlying foundation models used for keypoint extraction.

Key Contribution

Keypoint Imitation Learning leaps ahead of RGB baselines in robotic manipulation, but don't expect it to dethrone diffusion models just yet.

Abstract

RGB-based imitation learning requires many demonstrations to generalize to unseen objects or scenes, motivating research into intermediate representations to improve generalization for robotic manipulation. Visual foundation models enable one-shot extraction of keypoints to provide such representation. However, it remains unclear how to integrate them into imitation learning optimally and when they outperform alternative representations. We combine approaches from previous works on keypoint imitation learning (KIL) and investigate several design choices to provide practical guidelines. Using over 2000 real-world rollouts, we also assess the generalization capabilities of KIL to unseen objects and scene variations. KIL achieves a 75% overall success rate across five tasks, significantly outperforming the RGB baseline (47%) and performing on par with S2-diffusion (73%). Finally, we explore the limitations of the foundation models used for keypoint extraction and extend KIL to tasks with multiple object instances. Our results confirm KIL as a data-efficient approach for robot learning, though it does not outperform alternative representations and inherits limitations of the foundation models used for keypoint extraction. All rollout videos, demonstrations, and results are available at https://kil-manipulation.github.io/.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References36

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

On the Generalization Capabilities, Design Choices and Limitations of Keypoint Imitation Learning

Related Papers