Search papers, labs, and topics across Lattice.
This paper introduces a pipeline for recognizing objects pointed to by humans in RGB images, crucial for intuitive human-robot interaction. The system combines object detection, pose estimation, monocular depth estimation, and vision-language models to infer the target of a pointing gesture. Experiments on a custom dataset demonstrate that incorporating depth information derived from monocular images significantly improves target identification accuracy, particularly in cluttered scenes.
Monocular depth estimation can significantly boost the accuracy of pointing-based object recognition in human-robot interaction, even without specialized depth sensors.
This paper presents a comprehensive pipeline for recognizing objects targeted by human pointing gestures using RGB images. As human-robot interaction moves toward more intuitive interfaces, the ability to identify targets of non-verbal communication becomes crucial. Our proposed system integrates several existing state-of-the-art methods, including object detection, body pose estimation, monocular depth estimation, and vision-language models. We evaluate the impact of 3D spatial information reconstructed from a single image and the utility of image captioning models in correcting classification errors. Experimental results on a custom dataset show that incorporating depth information significantly improves target identification, especially in complex scenes with overlapping objects. The modularity of the approach allows for deployment in environments where specialized depth sensors are unavailable.