POSTECHMar 8, 2026arXiv:2603.07394

AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

AI Summary

The paper introduces AQuA, a new VQA dataset designed to evaluate and improve VLMs' ability to handle ambiguity in visual questions by categorizing ambiguity into four levels and defining optimal response strategies for each. They found that existing VLMs often fail to adapt their responses to the type of ambiguity, tending to provide overconfident answers instead of seeking clarification. Fine-tuning VLMs on AQuA enables them to strategically select appropriate response strategies, such as directly answering, inferring intent, listing alternatives, or requesting clarification, leading to improved performance on ambiguous VQA instances.

Key Contribution

VLMs often blunder when faced with ambiguous visual questions, but a new dataset and fine-tuning approach can teach them to strategically seek clarification or list possibilities instead of confidently guessing wrong.

Abstract

Visual Question Answering (VQA) is a core task for evaluating the capabilities of Vision-Language Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies. Although recent studies have begun to address ambiguity in VQA, they lack (1) a systematic categorization of ambiguity levels and (2) datasets and models that support strategy-aware responses. In this paper, we introduce Ambiguous Visual Question Answering (AQuA), a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case. Our evaluation of diverse open-source and proprietary VLMs shows that most models fail to adapt their strategy to the ambiguity type, frequently producing overconfident answers rather than seeking clarification or acknowledging uncertainty. To address this challenge, we fine-tune VLMs on AQuA, enabling them to adaptively choose among multiple response strategies, such as directly answering, inferring intent from contextual cues, listing plausible alternatives, or requesting clarification. VLMs trained on AQuA achieve strategic response generation for ambiguous VQA, demonstrating the ability to recognize ambiguity, manage uncertainty, and respond with context-appropriate strategies, while outperforming both open-source and closed-source baselines.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

Related Papers