Mar 16, 2026arXiv:2603.15403

Pointing-Based Object Recognition

AI Summary

This paper introduces a pipeline for recognizing objects pointed to by humans in RGB images, crucial for intuitive human-robot interaction. The system combines object detection, pose estimation, monocular depth estimation, and vision-language models to infer the target of a pointing gesture. Experiments on a custom dataset demonstrate that incorporating depth information derived from monocular images significantly improves target identification accuracy, particularly in cluttered scenes.

Key Contribution

Monocular depth estimation can significantly boost the accuracy of pointing-based object recognition in human-robot interaction, even without specialized depth sensors.

Abstract

This paper presents a comprehensive pipeline for recognizing objects targeted by human pointing gestures using RGB images. As human-robot interaction moves toward more intuitive interfaces, the ability to identify targets of non-verbal communication becomes crucial. Our proposed system integrates several existing state-of-the-art methods, including object detection, body pose estimation, monocular depth estimation, and vision-language models. We evaluate the impact of 3D spatial information reconstructed from a single image and the utility of image captioning models in correcting classification errors. Experimental results on a custom dataset show that incorporating depth information significantly improves target identification, especially in complex scenes with overlapping objects. The modularity of the approach allows for deployment in environments where specialized depth sensors are unavailable.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Pointing-Based Object Recognition

Related Papers