Jun 8, 2026arXiv:2606.09290

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Zizhao Tong, Xiaofeng Zhang, Xiaosong Yuan

AI Summary

This paper introduces Visual Para-Thinker++, a novel single-policy multi-agent framework designed to enhance visual reasoning by employing a shared MLLM policy across Main, Worker, and Summary Agents. By allowing Worker Agents to reason in parallel while the Summary Agent reconciles their outputs, the framework effectively mitigates issues of perceptual commitment and hallucination that plague traditional single-chain reasoning approaches. Experimental results demonstrate that Visual Para-Thinker++ significantly outperforms existing baselines across various benchmarks, particularly excelling in scenarios sensitive to hallucinations.

Key Contribution

Visual Para-Thinker++ achieves remarkable improvements in visual reasoning accuracy by leveraging a multi-agent architecture that minimizes hallucination risks through parallel processing and effective output reconciliation.

Abstract

Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment and hallucination. We propose Visual Para-Thinker++, a single-policy multi-agent framework in which one shared MLLM policy is instantiated as role-conditioned Main, Worker, and Summary Agents. The Main Agent decomposes the task with fixed allocation patterns; Worker Agents reason in parallel under context isolation; and the Summary Agent reconciles full Worker reasoning traces rather than majority-voting on final labels. The shared policy is trained by Multi-Agent Capability Injection and Role-Decoupled Multi-Agent Optimization, which assign role-specific rewards and advantages to corresponding token segments to reduce gradient conflict among collaborative roles. A native inference engine enables efficient multi-agent rollout through shared visual prefix and KV cache reuse. Across V*, CountBench, the RefCOCO family, and HallusionBench, Visual Para-Thinker++ consistently outperforms single-trajectory and inference-time parallel baselines, with especially strong gains on hallucination-sensitive visual reasoning.

Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References19

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

Related Papers