Mar 10, 2026arXiv:2603.09883

DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

Jiazhi Guan, Quanwei Yang, Luying Huang, Junhao Liang, Borong Liang, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou, Jingdong Wang

AI Summary

The paper introduces DISPLAY, a framework for generating controllable and physically consistent human-object interaction (HOI) videos using sparse motion guidance consisting of wrist joint coordinates and object bounding boxes. To improve fidelity with sparse guidance, they propose an Object-Stressed Attention mechanism and a Multi-Task Auxiliary Training strategy leveraging a curated HOI dataset. Experiments demonstrate DISPLAY's ability to generate high-fidelity, controllable HOI videos across diverse tasks, addressing limitations of prior methods relying on dense control signals or template videos.

Key Contribution

Generate realistic and controllable videos of humans interacting with objects using only sparse motion cues, like wrist positions and object bounding boxes.

Abstract

Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \href{https://mumuwei.github.io/DISPLAY/}.

Computer Vision Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

Related Papers