Search papers, labs, and topics across Lattice.
The paper introduces DISPLAY, a framework for generating controllable and physically consistent human-object interaction (HOI) videos using sparse motion guidance consisting of wrist joint coordinates and object bounding boxes. To improve fidelity with sparse guidance, they propose an Object-Stressed Attention mechanism and a Multi-Task Auxiliary Training strategy leveraging a curated HOI dataset. Experiments demonstrate DISPLAY's ability to generate high-fidelity, controllable HOI videos across diverse tasks, addressing limitations of prior methods relying on dense control signals or template videos.
Generate realistic and controllable videos of humans interacting with objects using only sparse motion cues, like wrist positions and object bounding boxes.
Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \href{https://mumuwei.github.io/DISPLAY/}.