Search papers, labs, and topics across Lattice.
This paper introduces Visual Para-Thinker++, a novel single-policy multi-agent framework designed to enhance visual reasoning by mitigating issues related to perceptual commitment and hallucination. By employing a shared MLLM policy with distinct roles for Main, Worker, and Summary Agents, the framework allows for parallel reasoning while maintaining context isolation, leading to improved accuracy in visual tasks. The method demonstrates significant performance improvements across various benchmarks, particularly in scenarios sensitive to hallucination, showcasing its effectiveness over traditional single-trajectory approaches.
Visual Para-Thinker++ achieves remarkable gains in visual reasoning accuracy, particularly in hallucination-prone tasks, by leveraging a unique multi-agent framework that enhances collaborative reasoning.
Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment and hallucination. We propose Visual Para-Thinker++, a single-policy multi-agent framework in which one shared MLLM policy is instantiated as role-conditioned Main, Worker, and Summary Agents. The Main Agent decomposes the task with fixed allocation patterns; Worker Agents reason in parallel under context isolation; and the Summary Agent reconciles full Worker reasoning traces rather than majority-voting on final labels. The shared policy is trained by Multi-Agent Capability Injection and Role-Decoupled Multi-Agent Optimization, which assign role-specific rewards and advantages to corresponding token segments to reduce gradient conflict among collaborative roles. A native inference engine enables efficient multi-agent rollout through shared visual prefix and KV cache reuse. Across V*, CountBench, the RefCOCO family, and HallusionBench, Visual Para-Thinker++ consistently outperforms single-trajectory and inference-time parallel baselines, with especially strong gains on hallucination-sensitive visual reasoning.