Guangdong Laboratory of ArtificialHITApr 19, 2026arXiv:2604.17241

GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

Kun Wang, Yiming Li, Mingcheng Qu, Aqiang Zhang, Guang Yang, Tonghua Su

AI Summary

This paper introduces GaLa, a novel vision language framework that utilizes a hypergraph-based representation to enhance procedural planning in embodied AI systems by effectively capturing implicit spatial relations and deep semantic structures from multimodal inputs. By modeling object instances as nodes and constructing hyperedges based on attributes and functional semantics, GaLa improves the understanding of complex scenes, addressing limitations in existing vision language models. Experimental results on the ActPlan1K and ALFRED benchmarks show that GaLa significantly outperforms current methods in execution success rate, logical consistency score (LCS), and planning correctness.

Key Contribution

GaLa's hypergraph representation reveals hidden semantic relationships in multimodal data, leading to a dramatic boost in procedural planning accuracy.

Abstract

Implicit spatial relations and deep semantic structures encoded in object attributes are crucial for procedural planning in embodied AI systems. However, existing approaches often over rely on the reasoning capabilities of vision language models (VLMs) themselves, while overlooking the rich structured semantic information that can be mined from multimodal inputs. As a result, models struggle to effectively understand functional spatial relationships in complex scenes. To fully exploit implicit spatial relations and deep semantic structures in multimodal data, we propose GaLa, a vision language framework for multimodal procedural planning. GaLa introduces a hypergraph-based representation, where object instances in the image are modeled as nodes, and region-level hyperedges are constructed by aggregating objects according to their attributes and functional semantics. This design explicitly captures implicit semantic relations among objects as well as the hierarchical organization of functional regions. Furthermore, we design a TriView HyperGraph Encoder that enforces semantic consistency across the node view, area view, and node area association view via contrastive learning, enabling hypergraph semantics to be more effectively injected into downstream VLM reasoning. Extensive experiments on the ActPlan1K and ALFRED benchmarks demonstrate that GaLa significantly outperforms existing methods in terms of execution success rate, LCS, and planning correctness.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

Related Papers