Feb 18, 2026arXiv:2602.16138

IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models

Parsa Madinei, Parsa Madinei, Srijita Karmakar, Srijita Karmakar, Russell Cohen Hoffing, R. Hoffing, Felix Gervitz, Felix Gervitz, Miguel P. Eckstein, Miguel P. Eckstein

AI Summary

The paper introduces IRIS, a training-free method for improving open-ended VQA in VLMs by incorporating real-time eye-tracking data to resolve ambiguity. IRIS leverages the insight that fixations closest to the start of a question are most informative, using these fixations as "saccades" to guide the VLM's attention. Experiments on a new benchmark dataset demonstrate that IRIS more than doubles accuracy on ambiguous questions (35.2% to 77.2%) while preserving performance on unambiguous ones across various VLMs.

Key Contribution

Eye-tracking unlocks a simple, training-free method to more than double VQA accuracy on ambiguous questions in large VLMs by focusing on fixations just before the question is asked.

Abstract

We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References49

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models

Related Papers