HITShanghaiTechUniversity of Chinese Academy of ScienceFeb 12, 2026arXiv:2602.11731

Thinking with Drafting: Optical Decompression via Logical Reconstruction

Jingxuan Wei, Honghao He, Honghao He, Caijun Jia, Yuhang Xu, Yuanyuan Lin, Yuanyuan Lin, Linzhuang Sun, Yuchen Wu, Bihui Yu, Xiangxiang Zhang, Cheng Tan

AI Summary

The paper introduces Thinking with Drafting (TwD), a novel approach to visual reasoning that bridges the gap between optical perception and logical exactness in multimodal LLMs. TwD uses a minimalist Domain-Specific Language (DSL) as an intermediate representation to force the model to draft its mental model into executable code, generating deterministic visual proofs for self-verification. Evaluated on a new visual algebra benchmark, VisAlg, TwD demonstrates improved performance by serving as a superior cognitive scaffold for reasoning over visual inputs.

Key Contribution

Visual reasoning gets a boost: forcing models to "draft" their reasoning in code and render visual proofs dramatically improves performance by bridging the gap between perception and logical structure.

Abstract

Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References28

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Thinking with Drafting: Optical Decompression via Logical Reconstruction

Related Papers