SenseTimeShandong Normal UniversitySJTUMay 26, 2026arXiv:2605.26520

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

Zhiwei Ning, Wenwen Tong, Xiangli Kong, Shengnan Ma, Ziyi Shang, Jingcheng Ni, Tao Hu, Yong Xien Chng, Jixuan Ying, Zehuan Wu, Hanming Deng, Jie Yang, Yuanjie Zheng, Wei Liu, Lewei Lu

AI Summary

InterSketch, a new vision-language model, enhances visual-textual chain-of-thought reasoning by interleaving textual reasoning with dynamically generated visual sketches using external tools. To train InterSketch, the authors first create a synthetic high-quality interleaved visual-textual chain-of-thought dataset with a reflection mechanism for self-correction, and then employ a stepwise reward mechanism during reinforcement learning to address reward sparsity. Experiments on visual reasoning benchmarks show InterSketch outperforms existing models, including Gemini-3-Pro, demonstrating the effectiveness of interleaved reasoning with self-correcting visual sketches.

Key Contribution

InterSketch shows that interleaving visual sketches with textual reasoning, guided by self-correction and stepwise rewards, unlocks surprisingly strong long-horizon visual reasoning, even surpassing Gemini-3-Pro.

Abstract

While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

Related Papers