Search papers, labs, and topics across Lattice.
The paper introduces Generative Visual Chain-of-Thought (GVCoT), a framework for image editing that generates spatial cues to localize the target region before performing the edit. GVCoT jointly optimizes visual tokens during reasoning and editing, fostering spatial reasoning and enabling better use of visual cues. To train GVCoT, the authors created GVCoT-Edit-Instruct, a dataset of 1.8M samples, and used a progressive training strategy involving supervised fine-tuning and reinforcement learning; experiments on SREdit-Bench and ImgEdit show GVCoT outperforms existing methods.
Forget brittle text-based reasoning: GVCoT unlocks more precise image editing by generating and optimizing visual reasoning cues directly within the image domain.
Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions. To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT), a unified framework that performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit. Unlike prior text-only CoT or tool-dependent visual CoT paradigms, GVCoT jointly optimizes visual tokens generated during the reasoning and editing phases in an end-to-end manner. This way fosters the emergence of innate spatial reasoning ability and enables more effective utilization of visual-domain cues. The main challenge of training GCVoT lies in the scarcity of large-scale editing data with precise edit region annotations; to this end, we construct GVCoT-Edit-Instruct, a dataset of 1.8M high-quality samples spanning 19 tasks. We adopt a progressive training strategy: supervised fine-tuning to build foundational localization ability in reasoning trace before final editing, followed by reinforcement learning to further improve reasoning and editing quality. Finally, we introduce SREdit-Bench, a new benchmark designed to comprehensively stress-test models under sophisticated scenes and fine-grained referring expressions. Experiments demonstrate that GVCoT consistently outperforms state-of-the-art models on SREdit-Bench and ImgEdit. We hope our GVCoT will inspire future research toward interpretable and precise image editing.