PolyUSJTUJun 8, 2026arXiv:2606.09585

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Yutong Bian, Dongjie Cheng, Heming Xia, Yongqi Li, Wenjie Li

AI Summary

This paper introduces optical reasoning, a novel approach that leverages images as a standalone medium for reasoning in both language and multimodal tasks. By employing typographic and graphical variants, the authors demonstrate that this method can match or exceed the performance of traditional text-based reasoning while significantly reducing the number of reasoning tokens used. The findings reveal that optical reasoning achieves an average token efficiency improvement of 1.96 times compared to text reasoning, highlighting the potential of images in enhancing reasoning processes.

Key Contribution

Images can serve as a powerful standalone medium for reasoning, achieving nearly double the token efficiency of traditional text methods.

Abstract

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.

Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Related Papers