F-InitiativesNorthwesternUniversité Sorbonne Paris NordApr 9, 2026arXiv:2604.08494

What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

M. A. Kerkouri, Mohamed Amine Kerkouri, Marouane Tliba, Marouane Tliba, Bin Wang, Bin Wang, Aladine Chetouani, Aladine Chetouani, Ulas Bagci, Ulas Bagci, Alessandro Bruno, Alessandro Bruno

AI Summary

This paper introduces a semantic scanpath similarity framework that leverages vision-language models (VLMs) to encode fixations into textual descriptions, enabling the computation of semantic similarity between scanpaths. They use patch-based and marker-based strategies to encode fixations under controlled visual context, and then compute semantic similarity using embedding-based and lexical NLP metrics. Results on free-viewing eye-tracking data show that semantic similarity captures variance independent from geometric alignment, highlighting content agreement even with spatial divergence.

Key Contribution

VLMs reveal that people can be looking at different places but still "seeing" the same thing, adding a crucial layer of semantic understanding to traditional eye-tracking analysis.

Abstract

Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

Related Papers