BJUTGroup 15th Research InstituteAug 22, 2025

Attention-based multimodal feature fusion visual question answering

Qi Zou, Shixuan Zhang, Hongyan Yang, Huisheng Ma

AI Summary

This paper introduces a novel VQA architecture that leverages structured visual relationship modeling with adaptive attention mechanisms for hierarchical cross-modal feature fusion. The approach uses sparse graph computations to capture multi-order dependencies among visual entities and a dynamically parameterized bilinear attention mechanism for language-conditioned feature recalibration. Experiments demonstrate the effectiveness of the proposed framework in enhancing semantic specificity in visual representations and improving answer inference accuracy.

Key Contribution

Achieve precise answer inference in VQA by fusing visual relationships with language-conditioned attention, all within a lightweight architecture.

Abstract

Visual Question Answering (VQA), an emerging interdisciplinary field bridging computer vision and natural language processing, has garnered substantial attention for its challenges in multimodal reasoning. Central to advancing VQA systems is the seamless alignment of cross-modal representations derived from visual inputs and linguistic queries. We demonstrate that integrating structured visual relationship modeling with adaptive attention mechanisms enables hierarchical fusion of cross-modal features. To this end, we propose a lightweight architecture that captures multi-order dependencies among visual entities through sparse graph computations. We develop a dynamically parameterized bilinear attention mechanism that performs language-conditioned feature recalibration, enhancing semantic specificity in visual representations. Our framework jointly processes visual and textual inputs through parallel relational reasoning and attention pathways, enabling cross-modal feature integration for precise answer inference.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References14

Year2025

Venue2025 CAA Symposium on Fault Detection, Supervision, and Safety for Technical Processes (SAFEPROCESS)

Related Papers

Finding related papers...

Search

Attention-based multimodal feature fusion visual question answering

Related Papers