Xiong-Hui Chen

Qwen Team, Alibaba Inc. 2 LeapLab, Tsinghua University Abstract Vision-language models (VLMs) show strong multimodal capabilities, but they still struggle with fine-grained vision-language reasoning. We find that long chain-of-thought (CoT) reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for reinforcement learning with verifiable rewards (RLVR) does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data specifically for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We add the multi-hop data synthesized by HopChain to the original RLVR data used to train Qwen3.5-

Tsinghua AI

Papers on Lattice

Total citations

Topics

h-index

Research focus

Data Curation & Synthetic Data (1)Multimodal Models (1)Reasoning & Chain-of-Thought (1)

Frequent co-authors

Shenzhi Wang (1)Shixuan Liu (1)Jing Zhou (1)Chang Gao (1)

Papers (1)

Mar 17, 2026

Tsinghua AIMar 17, 2026·also DAMO, Fudan

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Multi-hop data synthesis using HopChain boosts VLM performance across a wide range of tasks, with gains of over 50 points in accuracy for ultra-long-context reasoning.

Shenzhi Wang, Shixuan Liu, Jing Zhou +7

Data Curation & Synthetic Data Multimodal Models Reasoning & Chain-of-Thought

Search

Xiong-Hui Chen

Research focus

Frequent co-authors

Papers (1)