ENSTAInstitut Polytechnique de ParisApr 21, 2026arXiv:2604.19267

Multimodal embodiment-aware navigation transformer

Louis Dezons, Quentin Picard, Rémi Marsal, Franccois Goulette, François Goulette, David Filliat

AI Summary

ViLiNT, a multimodal navigation policy, is introduced to improve the robustness of goal-conditioned navigation models under distribution shift by fusing RGB images, LiDAR, goal embeddings, and robot embodiment descriptors within a transformer architecture. This transformer conditions a diffusion model to generate trajectories, which are then ranked using a path clearance prediction head trained on automatically generated offline labels. Results across simulated and real-world environments demonstrate a 166% improvement in Success Rate compared to a vision-only baseline, highlighting the benefits of multimodal fusion and collision prediction.

Key Contribution

Robots can navigate more robustly in the real world by learning to predict path clearance from multimodal data and a robot's own embodiment, even when conditions change.

Abstract

Goal-conditioned navigation models for ground robots trained using supervised learning show promising zero-shot transfer, but their collision-avoidance capability nevertheless degrades under distribution shift, i.e. environmental, robot or sensor configuration changes. We propose ViLiNT a multimodal, attention-based policy for goal navigation, trained on heterogeneous data from multiple platforms and environments, which improves robustness with two key features. First, we fuse RGB images, 3D LiDAR point clouds, a goal embedding and a robot's embodiment descriptor with a transformer architecture to capture complementary geometry and appearance cues. The transformer's output is used to condition a diffusion model that generates navigable trajectories. Second, using automatically generated offline labels, we train a path clearance prediction head for scoring and ranking trajectories produced by the diffusion model. The diffusion conditioning as well as the trajectory ranking head depend on a robot's embodiment token that allows our model to generate and select trajectories with respect to the robot's dimensions. Across three simulated environments, ViLiNT improves Success Rate on average by 166\% over equivalent state-of-the-art vision-only baseline (NoMaD). This increase in performance is confirmed through real-world deployments of a rover navigating in obstacle fields. These results highlight that combining multimodal fusion with our collision prediction mechanism leads to improved off-road navigation robustness.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References29

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Multimodal embodiment-aware navigation transformer

Related Papers