BITCentre for Artificial Intelligence and RoboticsInstitute of Artificial Intelligence and RoboticsNational Engineering Research Center for VisualNational Key Laboratory of Human-Machineof Things for Smart CityThe State Key Laboratory of InternetUMacauXJTUApr 19, 2026arXiv:2604.17407

Think before Go: Hierarchical Reasoning for Image-goal Navigation

Pengna Li, Kangyi Wu, Shaoqing Xu, Fang Li, Lin Zhao, Long Chen, Zhi-Xin Yang, Nanning Zheng

AI Summary

The paper introduces Hierarchical Reasoning Navigation (HRNav), a framework that decomposes image-goal navigation into high-level planning using a vision-language model and low-level execution using online reinforcement learning. HRNav trains a vision-language model on a self-collected dataset to generate short-horizon plans, which are then used to condition an RL policy for action selection. Experiments in simulation and real-world environments demonstrate that HRNav outperforms existing end-to-end navigation policies, especially in long-horizon tasks.

Key Contribution

Image-goal navigation gets a boost from hierarchical reasoning, using vision-language models for high-level planning and online RL for low-level execution, significantly reducing wandering and improving success in complex environments.

Abstract

Image-goal navigation steers an agent to a target location specified by an image in unseen environments. Existing methods primarily handle this task by learning an end-to-end navigation policy, which compares the similarities of target and observation images and directly predicts the actions. However, when the target is distant or lies in another room, such methods fail to extract informative visual cues, leading the agent to wander around. Motivated by the human cognitive principle that deliberate, high-level reasoning guides fast, reactive execution in complex tasks, we propose Hierarchical Reasoning Navigation (HRNav), a framework that decomposes image-goal navigation into high-level planning and low-level execution. In high-level planning, a vision-language model is trained on a self-collected dataset to generate a short-horizon plan, such as whether the agent should walk through the door or down the hallway. This downgrades the difficulty of the long-horizon task, making it more amenable to the execution part. In low-level execution, an online reinforcement learning policy is utilized to decide actions conditioned on the short-horizon plan. We also devise a novel Wandering Suppression Penalty (WSP) to further reduce the wandering problem. Together, these components form a hierarchical framework for Image-Goal Navigation. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Think before Go: Hierarchical Reasoning for Image-goal Navigation

Related Papers