Mar 2, 2026arXiv:2603.01804

Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments

Dragos Costea, Alina Marcu, Cristina Lazar, Marius Leordeanu

AI Summary

This paper introduces a real-time framework for non-verbal human-AI interaction using 2D body keypoints to investigate the statistical fidelity of AI-generated motion compared to human motion. They trained four lightweight architectures on human video clips and synthetically generated sequences, achieving up to 100 FPS on an NVIDIA Orin Nano. Their analysis reveals that pretraining on synthetic data improves motion accuracy, but a performance gap remains, with evaluations on keypoints from SORA and VEO indicating that temporal coherence in AI-generated video is more crucial than image fidelity for realistic interaction.

Key Contribution

Despite advances in text-to-video generation, AI-generated body language still lags behind human motion, with temporal coherence proving more crucial than visual fidelity for realistic interaction.

Abstract

We study the ongoing debate regarding the statistical fidelity of AI-generated data compared to human-generated data in the context of non-verbal communication using full body motion. Concretely, we ask if contemporary generative models move beyond surface mimicry to participate in the silent, but expressive dialogue of body language. We tackle this question by introducing the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints. Our experiments utilize four lightweight architectures which run at up to 100 FPS on an NVIDIA Orin Nano, effectively closing the perception-action loop needed for natural Human-AI interaction. We trained on 437 human video clips and demonstrated that pretraining on synthetically-generated sequences reduces motion errors significantly, without sacrificing speed. Yet, a measurable reality gap persists. When the best model is evaluated on keypoints extracted from cutting-edge text-to-video systems, such as SORA and VEO, we observe that performance drops on SORA-generated clips. However, it degrades far less on VEO, suggesting that temporal coherence, not image fidelity, drives real-world performance. Our results demonstrate that statistically distinguishable differences persist between Human and AI motion.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments

Related Papers