Apr 6, 2026arXiv:2604.04905

ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality

Dawar Khan, Alexandre Kouyoumdjian, Xinyu Liu, Omar Mena, Dominik Engel, Ivan Viola

AI Summary

ClickAIXR is introduced as a framework for on-device multimodal vision-language interaction in XR, enabling users to precisely select real-world objects using a controller. The selected object's image is processed locally by an on-device VLM to answer natural language questions, enhancing precision and addressing privacy concerns compared to cloud-based or gaze-based methods. User studies against cloud-based models (Gemini 2.5 Flash and ChatGPT 5) demonstrate acceptable user experience and moderate latency, highlighting the potential of on-device AI for trustworthy XR interactions.

Key Contribution

On-device VLMs can now power privacy-preserving and precise interactions with real-world objects in XR, rivaling cloud-based models in user experience.

Abstract

We present ClickAIXR, a novel on-device framework for multimodal vision-language interaction with objects in extended reality (XR). Unlike prior systems that rely on cloud-based AI (e.g., ChatGPT) or gaze-based selection (e.g., GazePointAR), ClickAIXR integrates an on-device vision-language model (VLM) with a controller-based object selection paradigm, enabling users to precisely click on real-world objects in XR. Once selected, the object image is processed locally by the VLM to answer natural language questions through both text and speech. This object-centered interaction reduces ambiguity inherent in gaze- or voice-only interfaces and improves transparency by performing all inference on-device, addressing concerns around privacy and latency. We implemented ClickAIXR in the Magic Leap SDK (C API) with ONNX-based local VLM inference. We conducted a user study comparing ClickAIXR with Gemini 2.5 Flash and ChatGPT 5, evaluating usability, trust, and user satisfaction. Results show that latency is moderate and user experience is acceptable. Our findings demonstrate the potential of click-based object selection combined with on-device AI to advance trustworthy, privacy-preserving XR interactions. The source code and supplementary materials are available at: nanovis.org/ClickAIXR.html

Computer Vision Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality

Related Papers