Apr 27, 2026arXiv:2604.24002

IntentVLM: Open-Vocabulary Intention Recognition through Forward-Inverse Modeling with Video-Language Models

Hamed Rahimi, Clémence Grislain, Adrien Jacquet Cretides, Olivier Sigaud, Mohamed Chetouani

AI Summary

IntentVLM, a novel video-language framework, is introduced to improve human intention recognition in social robots by mimicking forward-inverse modeling from cognitive science. The framework decomposes intention understanding into goal candidate generation and structured inference, mitigating hallucinations. Evaluated on IntentQA and Inst-IT Bench, IntentVLM achieves state-of-the-art results, surpassing baselines by 30% and matching human performance, demonstrating enhanced open-vocabulary intention understanding.

Key Contribution

Robots can now understand human intentions with near-human accuracy thanks to a new video-language model that reasons about goals like a human.

Abstract

Improving the effectiveness of human-robot interaction requires social robots to accurately infer human goals through robust intention understanding. This challenge is particularly critical in multimodal settings, where agents must integrate heterogeneous signals including text, visual cues to form a coherent interpretation of user intent. This paper presents IntentVLM, a novel two-stage video-language framework designed for open-vocabulary human intention recognition. The approach is inspired by forward-inverse modeling in cognitive science by decomposing intention understanding into goal candidate generation followed by structured inference through selection, effectively reducing hallucinations in latent reasoning. Evaluated on the IntentQA and Inst-IT Bench datasets, IntentVLM achieves state-of-the-art results with up to 80% accuracy, notably surpassing the baseline performance by 30% and matches human performance. Our findings demonstrate that this structured reasoning approach enhances open-vocabulary intention understanding without catastrophic forgetting, offering a robust foundation for human-centered robotics.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References62

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

IntentVLM: Open-Vocabulary Intention Recognition through Forward-Inverse Modeling with Video-Language Models

Related Papers