CASUC Santa CruzFeb 12, 2026arXiv:2602.11499

What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

Zhenlong Yuan, Xiangyan Qu, Lei Sun, Chengxuan Qian, Xiangxiang Chu, Shuo Li, Yuyin Zhou

AI Summary

This paper introduces ImagineAgent, a framework that uses cognitive reasoning and generative imagination to improve Open-Vocabulary Human-Object Interaction (OV-HOI) comprehension. ImagineAgent constructs cognitive maps to model relationships between entities and actions, and uses retrieval augmentation, image cropping, and diffusion models to gather knowledge and visual evidence. Experiments on SWIG-HOI and HICO-DET show state-of-the-art performance with significantly less training data.

Key Contribution

ImagineAgent's clever combination of cognitive maps and generative tools lets it crush previous state-of-the-art on OV-HOI tasks while needing only 20% of the training data.

Abstract

Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and occlusion-induced ambiguity. To address this, we propose \textbf{ImagineAgent}, an agentic framework that harmonizes cognitive reasoning with generative imagination for robust visual understanding. Specifically, our method innovatively constructs cognitive maps that explicitly model plausible relationships between detected entities and candidate actions. Subsequently, it dynamically invokes tools including retrieval augmentation, image cropping, and diffusion models to gather domain-specific knowledge and enriched visual evidence, thereby achieving cross-modal alignment in ambiguous scenarios. Moreover, we propose a composite reward that balances prediction accuracy and tool efficiency. Evaluations on SWIG-HOI and HICO-DET datasets demonstrate our SOTA performance, requiring approximately 20\% of training data compared to existing methods, validating our robustness and efficiency.

Computer Vision Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References48

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

Related Papers