SJTUMar 3, 2026arXiv:2603.02872

Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

Jialiang Zhang, Junlong Tong, Junyan Lin, Yirong Sun, Yunpu Ma, Xiaoyu Shen

AI Summary

The paper introduces Think-as-You-See (TaYS), a novel framework for streaming chain-of-thought reasoning in large vision-language models (LVLMs) that addresses the limitations of batch-style processing for video streams. TaYS enables concurrent reasoning by integrating parallelized CoT generation, stream-constrained training, and stream-parallel inference, utilizing temporally aligned reasoning units and a dual KV-cache to decouple visual encoding from textual reasoning. Experiments on the Qwen2.5-VL family demonstrate that TaYS outperforms batch and interleaved baselines on video CoT tasks, improving reasoning performance and reducing time-to-first-token (TTFT) and overall reasoning delay.

Key Contribution

LVLMs can reason about video streams *much* faster and better by thinking concurrently with the incoming data, not in batches.

Abstract

Large Vision Language Models (LVLMs) exhibit strong Chain-of-Thought (CoT) capabilities, yet most existing paradigms assume full-video availability before inference, a batch-style process misaligned with real-world video streams where information arrives sequentially. Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs. The first, an interleaved paradigm, alternates between receiving frames and producing partial reasoning but remains constrained by strictly ordered cache updates. To better match streaming inputs, we propose \textbf{Think-as-You-See (TaYS)}, a unified framework enabling true concurrent reasoning. TaYS integrates parallelized CoT generation, stream-constrained training, and stream-parallel inference. It further employs temporally aligned reasoning units, streaming attention masks and positional encodings, and a dual KV-cache that decouples visual encoding from textual reasoning. We evaluate all paradigms on the Qwen2.5-VL family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experiments show that TaYS consistently outperforms both batch and interleaved baselines, improving reasoning performance while substantially reducing time-to-first-token (TTFT) and overall reasoning delay. These results demonstrate the effectiveness of data-aligned streaming reasoning in enabling efficient and responsive video understanding for LVLMs. We release our code at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/TaYS}{this repository.}

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

Related Papers