Max PlanckMar 3, 2026arXiv:2603.03282

MIBURI: Towards Expressive Interactive Gesture Synthesis

M. Mughal, M. Hamza Mughal, Rishabh Dabral, Vera Demberg, Christian Theobalt

AI Summary

The paper introduces MIBURI, a novel online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. MIBURI uses body-part aware gesture codecs to encode hierarchical motion details into multi-level discrete tokens, which are then autoregressively generated conditioned on LLM-based speech-text embeddings. Evaluations show MIBURI produces natural and contextually aligned gestures in real-time, addressing the limitations of existing ECA solutions that often produce rigid or computationally expensive motions.

Key Contribution

Finally, realistic and expressive full-body gestures for conversational agents can be generated in real-time, thanks to a novel causal framework that models both temporal dynamics and part-level motion hierarchy.

Abstract

Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo videos on https://vcai.mpi-inf.mpg.de/projects/MIBURI/.

Natural Language Processing Robotics & Embodied AI Speech & Audio

Citation Metrics

Citations0

Influential citations0

References59

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MIBURI: Towards Expressive Interactive Gesture Synthesis

Related Papers