Stanford HAIKuaishouNov 27, 2025arXiv:2511.22154

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

Eun Chang, Zhuangqun Huang, Yiwei Liao, Sagar Ravi Bhavsar, Amogh Param, Tammy Stark, Adel Ahmadyan, Xiao Yang, Jiaqi Wang, Ahsan Abdullah, Giang Nguyen, Akil Iyer, David Hall, Elissa Li, Shane Moon, Nicolas Scheffer, Kirmani Ahmed, Babak Damavandi, Rakesh Wanga, Anuj Kumar, R. Patel, Xin Dong

AI Summary

The authors introduce WearVQA, a new benchmark designed to evaluate visual question answering (VQA) capabilities of multimodal AI assistants on wearable devices, addressing the challenges of egocentric, real-world scenarios. WearVQA consists of 2,520 image-question-answer triplets across diverse image domains, cognitive task types, and wearable-specific image quality issues. Experiments using open-source and proprietary multimodal LLMs reveal QA accuracy as low as 24-52%, highlighting the benchmark's difficulty and its potential to drive advancements in robust wearable AI systems.

Key Contribution

Current multi-modal LLMs struggle with the messy, real-world visual data captured by wearable devices, achieving only 24-52% accuracy on the new WearVQA benchmark.

Abstract

We introduce WearVQA, the first benchmark specifically designed to evaluate the Visual Question Answering (VQA) capabilities of multi-model AI assistant on wearable devices like smart glasses. Unlike prior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique challenges of ego-centric interaction-where visual inputs may be occluded, poorly lit, unzoomed, or blurry, and questions are grounded in realistic wearable use cases. The benchmark comprises 2,520 carefully curated image-question-answer triplets, spanning 7 diverse image domains including both text-centric and general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning, and 6 common wearables-specific image quality issues. All questions are designed to be answerable using only the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluation framework with 96% labeling accuracy. Open-source and proprietary multi-model LLMs achieved a QA accuracy as low as 24-52% on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks. These observations position WearVQA as a comprehensive and challenging benchmark for guiding technical advancement towards robust, real-world multi-model wearables AI systems.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References24

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

Related Papers