Search papers, labs, and topics across Lattice.
This paper introduces a novel approach to gaze simulation in dynamic driving scenes by modeling gaze as an autoregressive dynamical system conditioned on gaze history and environmental context represented as gaze-centric graphs. They propose the Affinity Relation Transformer (ART) to process these graphs and the Object Density Network (ODN) to predict next-step gaze distributions. Experiments on the new Focus100 dataset demonstrate that their method, trained directly on raw gaze data, generates more realistic gaze trajectories, scanpath dynamics, and saliency maps compared to existing attention models.
Modeling raw gaze trajectories with a novel graph-based transformer beats saliency map and scanpath baselines at predicting human attention in driving, suggesting we've been throwing away valuable temporal information.
Accurately modelling human attention is essential for numerous computer vision applications, particularly in the domain of automotive safety. Existing methods typically collapse gaze into saliency maps or scanpaths, treating gaze dynamics only implicitly. We instead formulate gaze modelling as an autoregressive dynamical system and explicitly unroll raw gaze trajectories over time, conditioned on both gaze history and the evolving environment. Driving scenes are represented as gaze-centric graphs processed by the Affinity Relation Transformer (ART), a heterogeneous graph transformer that models interactions between driver gaze, traffic objects, and road structure. We further introduce the Object Density Network (ODN) to predict next-step gaze distributions, capturing the stochastic and object-centric nature of attentional shifts in complex environments. We also release Focus100, a new dataset of raw gaze data from 30 participants viewing egocentric driving footage. Trained directly on raw gaze, without fixation filtering, our unified approach produces more natural gaze trajectories, scanpath dynamics, and saliency maps than existing attention models, offering valuable insights for the temporal modelling of human attention in dynamic environments.