Mar 16, 2026arXiv:2603.14811

Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning

Heng Zhou, Li Kang, Yiran Qin, Xiufeng Song, Ao Yu, Zilu Zhang, Haoming Song, Kaixin Xu, Yuchen Fan, Dongzhan Zhou, Xiaohong Liu, Ruimao Zhang, Philip Torr, Lei Bai, Zhenfei Yin

AI Summary

The paper introduces Ego-to-World (E2W), a benchmark for evaluating vision-language models on fusing heterogeneous viewpoints in embodied multi-agent systems across tasks like global counting, relational location reasoning, and action-oriented grasping. To address this, they propose CoRL, a two-stage framework combining Chain-of-Thought supervised fine-tuning with reinforcement learning using Group-Relative Policy Optimization and a Cross-View Spatial Reward (CVSR) to provide task-aligned feedback. Experiments on E2W and real-world multi-robot manipulation show CoRL surpasses strong baselines in reasoning, perception-grounding, and cross-view localization.

Key Contribution

Embodied agents can now collaboratively reason about space and manipulate objects in the real world, thanks to a new reinforcement learning approach that fuses their egocentric viewpoints into a world-centric understanding.

Abstract

Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi-agent systems. Each agent perceives the environment through an ego-centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego-to-World (E2W) benchmark, which evaluates a vision-language model's ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational location reasoning, and (iii) action-oriented grasping that requires predicting view-specific image coordinates. To address this setting, we propose CoRL, a two-stage framework that combines Chain-of-Thought supervised fine-tuning with reinforcement learning using Group-Relative Policy Optimization. Its core component, the Cross-View Spatial Reward (CVSR), provides dense task-aligned feedback by linking reasoning steps to visual evidence, ensuring coherent cross-view entity resolution, and guiding the model toward correct final predictions. Experiments on E2W show that CoRL consistently surpasses strong proprietary and open-source baselines on both reasoning and perception-grounding metrics, while ablations further confirm the necessity of each CVSR component. Beyond that, CoRL generalizes to external spatial reasoning benchmarks and enables effective real-world multi-robot manipulation with calibrated multi-camera rigs, demonstrating cross-view localization and successful grasp-and-place execution. Together, E2W and CoRL provide a principled foundation for learning world-centric scene understanding from distributed, ego-centric observations, advancing collaborative embodied AI.

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning

Related Papers