UIUCMar 16, 2026arXiv:2603.15717

GLANCE: Gaze-Led Attention Network for Compressed Edge-inference

Neeraj Solanki, Hong Ding, Sepehr Tabrizchi, Ali Shafiee Sarvestani, Shaahin Angizi, David Z. Pan, A. Roohi

AI Summary

The paper introduces GLANCE, a two-stage object detection pipeline for AR/VR that mimics foveal vision by combining a differentiable weightless neural network for gaze estimation with attention-guided ROI object detection. Gaze tracking is performed via memory lookups rather than MAC operations, achieving 8.32° angular error with minimal computational cost. GLANCE achieves 48.1% mAP on COCO with sub-10ms latency on an Arduino Nano 33 BLE, outperforming YOLOv12n, by selectively focusing object detection on attended regions, reducing computation by 40-50% and energy by 65%.

Key Contribution

Achieve real-time object detection on resource-constrained AR/VR devices by ditching compute-heavy operations for memory lookups inspired by human vision.

Abstract

Real-time object detection in AR/VR systems faces critical computational constraints, requiring sub-10\,ms latency within tight power budgets. Inspired by biological foveal vision, we propose a two-stage pipeline that combines differentiable weightless neural networks for ultra-efficient gaze estimation with attention-guided region-of-interest object detection. Our approach eliminates arithmetic-intensive operations by performing gaze tracking through memory lookups rather than multiply-accumulate computations, achieving an angular error of $8.32^{\circ}$ with only 393 MACs and 2.2 KiB of memory per frame. Gaze predictions guide selective object detection on attended regions, reducing computational burden by 40-50\% and energy consumption by 65\%. Deployed on the Arduino Nano 33 BLE, our system achieves 48.1\% mAP on COCO (51.8\% on attended objects) while maintaining sub-10\,ms latency, meeting stringent AR/VR requirements by improving the communication time by $\times 177$. Compared to the global YOLOv12n baseline, which achieves 39.2\%, 63.4\%, and 83.1\% accuracy for small, MEDium, and LARGE objects, respectively, the ROI-based method yields 51.3\%, 72.1\%, and 88.1\% under the same settings. This work shows that memory-centric architectures with explicit attention modeling offer better efficiency and accuracy for resource-constrained wearable platforms than uniform processing.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References26

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GLANCE: Gaze-Led Attention Network for Compressed Edge-inference

Related Papers