ImperialINRIAprojects/hoflow.htmlTencent AIApr 12, 2026arXiv:2604.10836

HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching

Zerui Chen, Rolandos Alexandros Potamias, Shi-Zhe Chen, Jiankang Deng, Cordelia Schmid, Stefanos Zafeiriou

AI Summary

HO-Flow is introduced, a framework for generating realistic hand-object interaction (HOI) motion sequences from text and canonical 3D objects. It uses an interaction-aware VAE to encode hand and object motions into a unified latent space, capturing interaction dynamics via hand and object kinematics. HO-Flow then employs a masked flow matching model with autoregressive temporal reasoning for continuous latent generation, achieving SOTA results on GRAB, OakInk, and DexYCB benchmarks for physical plausibility and motion diversity.

Key Contribution

Synthesizing realistic hand-object interactions is now possible with HO-Flow, a framework that leverages masked flow matching and interaction-aware VAEs to achieve state-of-the-art results in motion diversity and physical plausibility.

Abstract

Generating realistic 3D hand-object interactions (HOI) is a fundamental challenge in computer vision and robotics, requiring both temporal coherence and high-fidelity physical plausibility. Existing methods remain limited in their ability to learn expressive motion representations for generation and perform temporal reasoning. In this paper, we present HO-Flow, a framework for synthesizing realistic hand-object motion sequences from texts and canoncial 3D objects. HO-Flow first employs an interaction-aware variational autoencoder to encode sequences of hand and object motions into a unified latent manifold by incorporating hand and object kinematics, enabling the representation to capture rich interaction dynamics. It then leverages a masked flow matching model that combines auto-regressive temporal reasoning with continuous latent generation, improving temporal coherence. To further enhance generalization, HO-Flow predicts object motions relative to the initial frame, enabling effective pre-training on large-scale synthetic data. Experiments on the GRAB, OakInk, and DexYCB benchmarks demonstrate that HO-Flow achieves state-of-the-art performance in both physical plausibility and motion diversity for interaction motion synthesis.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching

Related Papers