Mar 2, 2026arXiv:2603.01586

InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

Yecong Wan, Fan Li, Chunwei Wang, Mingwen Shao, Wangmeng Zuo

AI Summary

The paper introduces InterCoG, a text-vision interleaved chain-of-grounding reasoning framework, to address the challenge of fine-grained image editing in complex, multi-entity scenes where targets are not visually salient. InterCoG performs object position reasoning in the text domain to deduce target location and identity, then grounds this reasoning visually using bounding boxes and masks, and finally refines the editing description. The approach is trained with multimodal grounding reconstruction supervision and reasoning alignment, and evaluated on a newly constructed GroundEdit-45K dataset, demonstrating improved spatial precision in complex editing scenarios.

Key Contribution

Achieve spatially precise image edits in complex scenes by explicitly reasoning about object positions in text *before* visual grounding.

Abstract

Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this paradigm, we propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability, respectively. We also construct GroundEdit-45K, a dataset comprising 45K grounding-oriented editing samples with detailed reasoning annotations, and GroundEdit-Bench for grounding-aware editing evaluation. Extensive experiments substantiate the superiority of our approach in highly precise edits under spatially intricate and multi-entity scenes.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

Related Papers