BUPTTencent AIMar 2, 2026arXiv:2603.01893

Generative Visual Chain-of-Thought for Image Editing

Zijin Yin, Tiankai Hang, Yiji Cheng, Shiyi Zhang, Runze He, Yu Xu, Chunyu Wang, Kongming Liang, Qinglin Lu, Zhanyu Ma

AI Summary

The paper introduces Generative Visual Chain-of-Thought (GVCoT), a framework for image editing that generates spatial cues to localize the target region before performing the edit. GVCoT jointly optimizes visual tokens during reasoning and editing, fostering spatial reasoning and enabling better use of visual cues. To train GVCoT, the authors created GVCoT-Edit-Instruct, a dataset of 1.8M samples, and used a progressive training strategy involving supervised fine-tuning and reinforcement learning; experiments on SREdit-Bench and ImgEdit show GVCoT outperforms existing methods.

Key Contribution

Forget brittle text-based reasoning: GVCoT unlocks more precise image editing by generating and optimizing visual reasoning cues directly within the image domain.

Abstract

Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions. To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT), a unified framework that performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit. Unlike prior text-only CoT or tool-dependent visual CoT paradigms, GVCoT jointly optimizes visual tokens generated during the reasoning and editing phases in an end-to-end manner. This way fosters the emergence of innate spatial reasoning ability and enables more effective utilization of visual-domain cues. The main challenge of training GCVoT lies in the scarcity of large-scale editing data with precise edit region annotations; to this end, we construct GVCoT-Edit-Instruct, a dataset of 1.8M high-quality samples spanning 19 tasks. We adopt a progressive training strategy: supervised fine-tuning to build foundational localization ability in reasoning trace before final editing, followed by reinforcement learning to further improve reasoning and editing quality. Finally, we introduce SREdit-Bench, a new benchmark designed to comprehensively stress-test models under sophisticated scenes and fine-grained referring expressions. Experiments demonstrate that GVCoT consistently outperforms state-of-the-art models on SREdit-Bench and ImgEdit. We hope our GVCoT will inspire future research toward interpretable and precise image editing.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Generative Visual Chain-of-Thought for Image Editing

Related Papers