Tsinghua AIAI LaboratoryNWPU ★ RopediaShanghai AI LabSJTUTencent AIUSTCXJTUZJUJun 16, 2026arXiv:2606.18239

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

Ning Gao, Jinliang Zheng, Xing Gao, Haoxiang Ma, Hanqing Wang, Yukai Wang, Jiantong Chen, Zanxin Chen, Shujie Zhang, Mingda Jia, Xuekun Jiang, Zihou Zhu, Xinyu Li, Shuai Wang, Hao Li, Wenzhe Cai, Yuqiang Yang, Xudong Xu, Zhaoyang Lyu, Yao Mu, Tai Wang, Jiangmiao Pang, Jia Zeng, Weinan Zhang, Chunhua Shen

AI Summary

EBench is a novel simulation benchmark designed to evaluate generalist mobile manipulation policies across 26 diverse tasks, focusing on five capability dimensions and four generalization dimensions. The evaluation of leading models, including $π_0$, $π_{0.5}$, XVLA, and InternVLA-A1, uncovers significant discrepancies in their capability profiles despite similar success rates, with $π_{0.5}$ excelling in test success and retention, while InternVLA-A1 struggles with dexterous tasks. This comprehensive analysis not only highlights the nuanced strengths and weaknesses of these models but also provides insights into their generalization abilities under various distribution shifts, offering valuable diagnostic signals for future improvements in mobile manipulation.

Key Contribution

Models with similar success rates can have drastically different capability profiles, revealing hidden strengths and weaknesses in mobile manipulation tasks.

Abstract

We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including $π_0$, $π_{0.5}$, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: $π_{0.5}$ achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.

Eval Frameworks & Benchmarks Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

Related Papers