Apr 6, 2026arXiv:2604.04843

InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement

Yude Zou, Junji Gong, Xing Gao, Zixuan Li, Tianxing Chen, Guanjie Zheng

AI Summary

This paper introduces InfBaGel, a coarse-to-fine framework for generating human-object-scene interactions (HOSI) conditioned on instructions, addressing the challenges of dynamic scene changes and limited annotated data. The method employs a dynamic perception strategy that uses trajectories from previous refinement steps to update scene context within a consistency model's iterative denoising process. By also incorporating bump-aware guidance and a hybrid training strategy using pseudo-HOSI samples, InfBaGel achieves state-of-the-art performance in HOSI and HOI generation, with strong generalization to unseen scenes.

Key Contribution

Generating realistic human interactions with objects and scenes is now possible even with limited data, thanks to a clever training strategy that combines synthetic and real data.

Abstract

Human-object-scene interactions (HOSI) generation has broad applications in embodied AI, simulation, and animation. Unlike human-object interaction (HOI) and human-scene interaction (HSI), HOSI generation requires reasoning over dynamic object-scene changes, yet suffers from limited annotated data. To address these issues, we propose a coarse-to-fine instruction-conditioned interaction generation framework that is explicitly aligned with the iterative denoising process of a consistency model. In particular, we adopt a dynamic perception strategy that leverages trajectories from the preceding refinement to update scene context and condition subsequent refinement at each denoising step of consistency model, yielding consistent interactions. To further reduce physical artifacts, we introduce a bump-aware guidance that mitigates collisions and penetrations during sampling without requiring fine-grained scene geometry, enabling real-time generation. To overcome data scarcity, we design a hybrid training startegy that synthesizes pseudo-HOSI samples by injecting voxelized scene occupancy into HOI datasets and jointly trains with high-fidelity HSI data, allowing interaction learning while preserving realistic scene awareness. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both HOSI and HOI generation, and strong generalization to unseen scenes. Project page: https://yudezou.github.io/InfBaGel-page/

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement

Related Papers