Guangxi China-Tek Blue Valley Semiconductor Technology Co.LtdGuangxi China-Tek Blue Valley Semiconductor Technology Company Ltd.GXUUSTCAug 29, 2025

Visual and Textual Commonsense-Enhanced Layout Learning for Vision-and-Language Navigation

Fang Gao, Lei Shi, Jingfeng Tang, Jiabao Wang, Shaodong Li, Shengheng Ma, Jun Yu

AI Summary

This paper introduces ViTeC, a Visual and Textual Commonsense-Enhanced Layout Learning Model, to improve Vision-and-Language Navigation (VLN) by addressing the lack of commonsense knowledge in navigation agents. ViTeC leverages ChatGPT and BLIP-2 to provide textual commonsense about room layouts and employs Stable Diffusion to generate commonsense-based visual images, enhancing the agent's understanding of the environment. Experiments on REVERIE, R2R, and SOON datasets demonstrate that ViTeC achieves strong performance and generalization ability in complex environments, validating its effectiveness in enhancing environmental understanding and navigation capabilities.

Key Contribution

VLN agents can navigate more effectively by learning commonsense relationships between rooms and landmarks, thanks to a new method that injects knowledge from ChatGPT, BLIP-2, and Stable Diffusion.

Abstract

In the Vision-and-Language Navigation (VLN) task, an agent must comprehend natural language instructions and execute precise navigation in complex environments. While significant progress has been made in the VLN field, the limited availability of navigation data hinders existing methods from fully learning the commonsense relationships between rooms and landmarks, which are crucial for environmental understanding and successful navigation. To address this issue, this work proposes a Visual and Textual Commonsense-Enhanced Layout Learning Model (ViTeC). We leverage the open-world knowledge embedded in large models by utilizing ChatGPT and BLIP-2 to provide commonsense information about environments. Specifically, BLIP-2 analyzes the room type corresponding to each panoramic image, while ChatGPT infers and provides knowledge about the most common landmarks within each room type. Moreover, to compensate for the agent’s lack of commonsense at the visual level, we employ Stable Diffusion to generate commonsense-based visual images, enhancing the agent’s visual perception. To ensure the agent effectively learns commonsense about the environment, we designed a Text Commonsense Layout Learning Module and a Visual Commonsense Layout Learning Module. These modules help the agent acquire environmental commonsense from both linguistic and visual perspectives, enabling it to utilize commonsense information effectively during navigation, thereby improving its environmental understanding and reasoning capabilities. Experimental results demonstrate that ViTeC achieves strong performance on REVERIE, R2R, and SOON datasets, exhibiting good generalization ability in complex environments. This validates the effectiveness of ViTeC in enhancing the agent’s environmental understanding and navigation capabilities. Note to Practitioners—In real-world applications, service robots often lack commonsense knowledge of environmental layouts, leading to inefficient navigation in complex scenarios, particularly in smart homes, medical assistance, and retail guidance. To address this issue, our research integrates knowledge from large models and visual generation techniques, enabling navigation agents to leverage commonsense information about room layouts and the distribution of common objects to enhance path planning. The advantage of this approach lies in improving the agent’s understanding of the environment, thereby reducing unnecessary exploration and increasing navigation accuracy. However, its effectiveness depends on the comprehensiveness of the knowledge base and the agent’s visual reasoning capabilities, which may pose adaptability challenges in highly dynamic or non-standard environments. In the future, this method can be further enhanced through online knowledge updates, adaptive learning, and multimodal data fusion, equipping robots with stronger navigation capabilities in a wider range of real-world scenarios.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations1

Influential citations0

References62

Year2025

VenueIEEE Transactions on Automation Science and Engineering

Related Papers

Finding related papers...

Search

Visual and Textual Commonsense-Enhanced Layout Learning for Vision-and-Language Navigation

Related Papers