Search papers, labs, and topics across Lattice.
The authors introduce WearVQA, a new benchmark designed to evaluate visual question answering (VQA) capabilities of multimodal AI assistants on wearable devices, addressing the challenges of egocentric, real-world scenarios. WearVQA consists of 2,520 image-question-answer triplets across diverse image domains, cognitive task types, and wearable-specific image quality issues. Experiments using open-source and proprietary multimodal LLMs reveal QA accuracy as low as 24-52%, highlighting the benchmark's difficulty and its potential to drive advancements in robust wearable AI systems.
Current multi-modal LLMs struggle with the messy, real-world visual data captured by wearable devices, achieving only 24-52% accuracy on the new WearVQA benchmark.
We introduce WearVQA, the first benchmark specifically designed to evaluate the Visual Question Answering (VQA) capabilities of multi-model AI assistant on wearable devices like smart glasses. Unlike prior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique challenges of ego-centric interaction-where visual inputs may be occluded, poorly lit, unzoomed, or blurry, and questions are grounded in realistic wearable use cases. The benchmark comprises 2,520 carefully curated image-question-answer triplets, spanning 7 diverse image domains including both text-centric and general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning, and 6 common wearables-specific image quality issues. All questions are designed to be answerable using only the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluation framework with 96% labeling accuracy. Open-source and proprietary multi-model LLMs achieved a QA accuracy as low as 24-52% on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks. These observations position WearVQA as a comprehensive and challenging benchmark for guiding technical advancement towards robust, real-world multi-model wearables AI systems.