Search papers, labs, and topics across Lattice.
This paper introduces the Multi-View Hierarchical Graph Neural Network (MV-HGNN) for sketch-based 3D shape retrieval, addressing the limitations of existing methods that inadequately capture geometric relationships between multi-view features. By constructing a view-level graph and employing local graph convolution alongside global attention, the framework enhances the discriminative power of 3D representations while enabling category-agnostic alignment through CLIP text embeddings. Extensive experiments reveal that MV-HGNN significantly outperforms state-of-the-art approaches in both category-level and zero-shot retrieval scenarios.
MV-HGNN achieves superior 3D shape retrieval by effectively leveraging geometric dependencies and semantic alignment, outperforming existing methods in zero-shot settings.
Sketch-based 3D shape retrieval (SBSR) aims to retrieve 3D shapes that are consistent with the category of the input hand-drawn sketch. The core challenge of this task lies in two aspects: existing methods typically employ simplified aggregation strategies for independently encoded 3D multi-view features, which ignore the geometric relationships between views and multi-level details, resulting in weak 3D representation. Simultaneously, traditional SBSR methods are constrained by visible category limitations, leading to poor performance in zero-shot scenarios. To address these challenges, we propose Multi-View Hierarchical Graph Neural Network (MV-HGNN), a novel framework for SBSR. Specifically, we construct a view-level graph and capture adjacent geometric dependencies and cross-view message passing via local graph convolution and global attention. A view selector is further introduced to perform hierarchical graph coarsening, enabling a progressively larger receptive field for graph convolution and mitigating the interference of redundant views, which leads to more discriminate discriminative hierarchical 3D representation. To enable category agnostic alignment and mitigate overfitting to seen classes, we leverage CLIP text embeddings as semantic prototypes and project both sketch and 3D features into a shared semantic space. We use a two-stage training strategy for category-level retrieval and a one-stage strategy for zero-shot retrieval under the same model architecture. Under both category-level and zero-shot settings, extensive experiments on two public benchmarks demonstrate that MV-HGNN outperforms state-of-the-art methods.