Aug 23, 2025arXiv:2508.16974

Hierarchical Contextual Grounding LVLM: Enhancing Fine-Grained Visual-Language Understanding with Robust Grounding

Leilei Guo, Antonio Carlos Rivera, Peiyu Tang, Haoxuan Ren, Zheyu Song

AI Summary

The paper introduces Hierarchical Contextual Grounding LVLM (HCG-LVLM), a novel LVLM architecture designed to improve robustness and reduce hallucination in fine-grained visual-language understanding tasks. HCG-LVLM uses a two-layered approach with global contextual perception and fine-grained local grounding, incorporating a local detail enhancement module and a semantic consistency validator. Experiments on GQA, A-OKVQA, and RefCOCO datasets demonstrate that HCG-LVLM outperforms state-of-the-art models by achieving higher accuracy and reduced hallucination.

Key Contribution

LVLMs can achieve state-of-the-art performance in fine-grained visual reasoning tasks by mimicking human coarse-to-fine cognitive processing with a hierarchical architecture.

Abstract

Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) have achieved remarkable progress in natural language processing and multimodal understanding. Despite their impressive generalization capabilities, current LVLMs often exhibit insufficient robustness, proneness to hallucination, and reasoning errors in complex real-world scenarios, particularly when precise image region localization and fine-grained visual reasoning are required. To address these limitations, we propose the Hierarchical Contextual Grounding LVLM (HCG-LVLM), a novel architecture that mimics human coarse-to-fine cognitive processing. HCG-LVLM employs a two-layered approach: a Global Contextual Perception layer for initial broad understanding and a Fine-grained Local Grounding layer. The latter incorporates a Local Detail Enhancement Module to extract high-resolution features and a Semantic Consistency Validator to ensure accurate, hallucination-free visual-language alignment. Through an adaptive fusion mechanism, information from both layers is integrated for robust and precise outputs. Extensive experiments on challenging datasets, including GQA, A-OKVQA for fine-grained VQA, and RefCOCO/+/g for Referring Expression Comprehension, demonstrate that HCG-LVLM consistently outperforms state-of-the-art models such as Flamingo, BLIP-2, and MiniGPT-4. Our model achieves superior accuracy and significantly reduces hallucination, validating the effectiveness of its hierarchical design in enhancing fine-grained visual-language understanding and precise grounding capabilities.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations1

Influential citations0

References30

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Hierarchical Contextual Grounding LVLM: Enhancing Fine-Grained Visual-Language Understanding with Robust Grounding

Related Papers