NJUApr 22, 2026arXiv:2604.20361

Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models

Rong Quan, Yantao Lai, Dong Liang, Jie Qin

AI Summary

This paper tackles Object Referring-guided Scanpath Prediction (ORSP) by introducing ScanVLA, a model that leverages a Vision-Language Model (VLM) for multimodal feature fusion. To improve positional awareness, they incorporate a History Enhanced Scanpath Decoder (HESD) that uses past fixations and integrate a frozen Segmentation LoRA to better localize the referred object. Experiments show ScanVLA significantly outperforms existing methods in predicting human attention scanpaths during object search.

Key Contribution

Achieve state-of-the-art object referring-guided scanpath prediction by fusing VLMs with fixation history and segmentation LoRA, demonstrating the power of perception-enhanced vision-language models.

Abstract

Object Referring-guided Scanpath Prediction (ORSP) aims to predict the human attention scanpath when they search for a specific target object in a visual scene according to a linguistic description describing the object. Multimodal information fusion is a key point of ORSP. Therefore, we propose a novel model, ScanVLA, to first exploit a Vision-Language Model (VLM) to extract and fuse inherently aligned visual and linguistic feature representations from the input image and referring expression. Next, to enhance the ScanVLA's perception of fine-grained positional information, we not only propose a novel History Enhanced Scanpath Decoder (HESD) that directly takes historical fixations'position information as input to help predict a more reasonable position for the current fixation, but also adopt a frozen Segmentation LoRA as an auxiliary component to help localize the referred object more precisely, which improves the scanpath prediction task without incurring additional large computational and time costs. Extensive experimental results demonstrate that ScanVLA can significantly outperform existing scanpath prediction methods under object referring.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References39

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models

Related Papers