UChicagoUSCMay 26, 2026arXiv:2605.27759

Colosseum V2: Benchmarking Generalization for Vision Language Action Models

Jeremy Morgan, Jeremy Morgan, Prajwal Vijay, Prajwal Vijay, Hyeonho Oh, Hyeonho Oh, Jincen Song, Jincen Song, Ashvin Arora, Ashvin Arora, Alina Du, Alina Du, Gaurav Sukhatme, Gaurav Sukhatme, Jesse Thomason, Jesse Thomason, Ishika Singh, Ishika Singh

AI Summary

Colosseum V2, a large-scale simulation benchmark, is introduced to evaluate the generalization capabilities of Vision-Language-Action (VLA) models in robotic manipulation across diverse conditions. The benchmark comprises 28 tasks spanning 13 task categories and two robot morphologies, built on the ManiSkill simulator for fast, GPU-parallelized evaluation. Evaluations of state-of-the-art methods like ACT and Pi0.5 reveal limitations in both base performance and generalization, highlighting the need for more robust translation of high-level understanding into behavior.

Key Contribution

VLA models may seem impressive, but Colosseum V2 reveals their robotic manipulation performance often crumbles under distribution shifts.

Abstract

Vision-Language-Action (VLA) models demonstrate promising generalization in robotic manipulation, driven by advances in large-scale vision and language pre-training. This progress can be misleading. Despite the zero-shot perception and language capabilities of VLAs, their overall task performance often degrades under distribution shifts, revealing gaps in how these systems translate high-level understanding into robust behavior. To systematically study this gap, we introduce Colosseum V2, a large-scale simulation benchmark for evaluating VLA generalization in robot learning across diverse conditions. The benchmark comprises 28 tasks spanning 13 task categories and two robot morphologies, covering a wide range of manipulation primitives and long-horizon behaviors. Built on the ManiSkill simulator, Colosseum V2 enables fast, GPU-parallelized evaluation and supports both in-domain and out-of-domain testing at scale. We evaluate state-of-the-art methods, including Action Chunking Transformers (ACT) and Pi0.5, and reveal limitations in both base performance and generalization. We demonstrate strong correlations between simulation and real-world metrics that support the ecological validity of the benchmark. By standardizing tasks, metrics, and evaluation protocols within a unified benchmark, Colosseum V2 enables reproducible and fair comparisons, reduced evaluation overhead, and accelerated progress toward general-purpose robot policies.

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References42

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Colosseum V2: Benchmarking Generalization for Vision Language Action Models

Related Papers