College of Intelligent Systems Science and EngineeringNov 13, 2025

A visual question answering method based on task decomposition

AI Summary

This paper addresses the limitations of sequence-to-sequence networks in task decomposition for visual question answering (VQA) when dealing with flexible natural language. They propose a Graph-to-Sequence Task Decomposition Network (Graph2Seq-TDN) that leverages semantic structural information from natural language to guide task decomposition and improve parsing accuracy. Experimental results on CLEVR, CLEVR-Human, CLEVR-CoGenT, and GQA datasets demonstrate that Graph2Seq-TDN achieves superior answering accuracy, program accuracy, and training efficiency compared to existing models.

Key Contribution

By encoding semantic structure into the task decomposition process, Graph2Seq-TDN achieves state-of-the-art VQA performance while improving interpretability and reducing data bias dependencies.

Abstract

Visual question answering (VQA) as an interdisciplinary task of computer vision and natural language processing, estimating the model’s visual reasoning ability, which requires the integration of image information extraction technology and natural language understanding technology. The testing on professional benchmark which controls the potential bias states that the VQA method based on task decomposition is a promising approach, offering advantages in interpretability at program execution stage and reducing data bias dependencies, compared with traditional VQA methods that only rely on multimodal fusion. The VQA method based on task decomposition decomposes the task by parsing natural language and it usually parses the language with sequence-to-sequence networks. It has limitations when faced with flexible and varied natural language, making it difficult to accurately decompose the task. To address this issue, we propose a Graph-to-Sequence Task Decomposition Network (Graph2Seq-TDN), which uses semantic structural information from natural language to guide the task decomposition process and improve parsing accuracy, additionally, in terms of reasoning execution, in addition to the original symbolic reasoning execution, we propose a reasoning executor to enhance execution performance. We conducted validation on four datasets: CLEVR, CLEVR-Human, CLEVR-CoGenT and GQA. The experimental results showed that our model outperformed the comparative model in terms of answering accuracy, program accuracy, and training costs under the same accuracy.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References53

Year2025

VenuePLoS ONE

Related Papers

Finding related papers...

Search

A visual question answering method based on task decomposition

Related Papers