Mar 5, 2026arXiv:2603.05017

Direct Contact-Tolerant Motion Planning With Vision Language Models

He Li, Jian Sun, Chengyang Li, Guoliang Li, Qiyu Ruan, Shuai Wang, Chengzhong Xu

AI Summary

This paper introduces a direct contact-tolerant (DCT) motion planner that integrates vision-language models (VLMs) for robot navigation in cluttered environments. The approach uses a VLM point cloud partitioner (VPP) to reason about contact tolerance in image space and generate contact-aware point clouds. A VPP-guided navigation (VGN) module then formulates contact-tolerant motion planning as a perception-to-control optimization problem solved by a DNN. Experiments in simulation and on a real robot demonstrate DCT's robustness and efficiency compared to baselines.

Key Contribution

Robots can now navigate cluttered spaces more efficiently by directly "seeing" and tolerating contact with movable objects, thanks to a vision-language model that reasons about contact in image space.

Abstract

Navigation in cluttered environments often requires robots to tolerate contact with movable or deformable objects to maintain efficiency. Existing contact-tolerant motion planning (CTMP) methods rely on indirect spatial representations (e.g., prebuilt map, obstacle set), resulting in inaccuracies and a lack of adaptiveness to environmental uncertainties. To address this issue, we propose a direct contact-tolerant (DCT) planner, which integrates vision-language models (VLMs) into direct point perception and navigation, including two key components. The first one is VLM point cloud partitioner (VPP), which performs contact-tolerance reasoning in image space using VLM, caches inference masks, propagates them across frames using odometry, and projects them onto the current scan to generate a contact-aware point cloud. The second innovation is VPP guided navigation (VGN), which formulates CTMP as a perception-to-control optimization problem under direct contact-aware point cloud constraints, which is further solved by a specialized deep neural network (DNN). We implement DCT in Isaac Sim and a real car-like robot, demonstrating that DCT achieves robust and efficient navigation in cluttered environments with movable obstacles, outperforming representative baselines across diverse metrics. The code is available at: https://github.com/ChrisLeeUM/DCT.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References23

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Direct Contact-Tolerant Motion Planning With Vision Language Models

Related Papers