May 6, 2026arXiv:2605.04770

Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction

Berk Sezer, Ali Gorkem Kuccuk, Erol cSahin, S. Kalkan

AI Summary

The paper introduces Gaze4HRI, a large-scale dataset designed to benchmark zero-shot 3D gaze estimation models specifically for Human-Robot Interaction, addressing limitations of existing benchmarks that lack HRI-relevant conditions like dynamic viewpoints and moving targets. Through comprehensive evaluation, the benchmark reveals that current state-of-the-art methods struggle in at least one HRI condition, with steeply-downward gaze being a common failure point. The study surprisingly finds that models trained on diverse datasets, like ETH-X-Gaze, exhibit greater zero-shot robustness than those employing complex architectures, suggesting data diversity and resilience-enhancing training techniques are key for HRI.

Key Contribution

Turns out, all gaze estimation models stumble when robots look down, and complex architectures aren't the answer – data diversity is the real secret to robust human-robot interaction.

Abstract

While zero-shot appearance-based 3D gaze estimation offers significant cost-efficiency by directly mapping RGB images to gaze vectors, its reliability in Human-Robot Interaction (HRI) settings remains uncertain. Existing benchmarks frequently overlook fundamental HRI conditions, such as dynamic camera viewpoints and moving targets in video. Furthermore, current cross-dataset evaluations often suffer from a complexity gap, where methods trained on diverse datasets are tested on significantly smaller and less varied sets, failing to assess true robustness. To bridge these gaps, we introduce Gaze4HRI, a large-scale dataset (50+ subjects, 3,000+ videos, 600,000+ frames) designed to evaluate state-of-the-art performance against critical HRI variables: illumination, head-gaze conflict, as well as the motion of camera and gaze target in video. Our benchmark reveals that all evaluated methods fail in at least one condition, identifying steeply-downward gaze as a universal failure point. Notably, PureGaze trained on the ETH-X-Gaze dataset uniquely maintains resilience across all other conditions. These results challenge the recent focus in the literature on complex spatial-temporal modeling and Transformer-based architectures. Instead, our findings suggest that extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments, while resilience-enhancing frameworks, such as PureGaze's self-adversarial loss for gaze feature purification, provide a substantial further improvement. Ultimately, this study establishes a rigorous benchmark that provides practical guidelines for practitioners as well as reshaping future research. The dataset and codes are available at https://gazeforhri.github.io.

Computer Vision Eval Frameworks & Benchmarks Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction

Related Papers