Search papers, labs, and topics across Lattice.
The paper introduces IRIS, a training-free method for improving open-ended VQA in VLMs by incorporating real-time eye-tracking data to resolve ambiguity. IRIS leverages the insight that fixations closest to the start of a question are most informative, using these fixations as "saccades" to guide the VLM's attention. Experiments on a new benchmark dataset demonstrate that IRIS more than doubles accuracy on ambiguous questions (35.2% to 77.2%) while preserving performance on unambiguous ones across various VLMs.
Eye-tracking unlocks a simple, training-free method to more than double VQA accuracy on ambiguous questions in large VLMs by focusing on fixations just before the question is asked.
We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.