Mar 16, 2026arXiv:2603.15432

Gym-V: A Unified Vision Environment System for Agentic Vision Research

Fanqing Meng Lingxiao Du Jiawei Gu Jiaqi Liao Linjie Li Zijian Wu Xiangyan Liu Ziqi Zhao Mengkang Hu Yue Zhang Zichen Liu Jiaheng Zhang Michael Qizhe Shieh

AI Summary

Gym-V is introduced as a unified platform comprising 179 procedurally generated visual environments across 10 domains, designed to facilitate controlled experiments in agentic vision research. The study reveals that observation scaffolding, such as captions and game rules, significantly impacts training success, outweighing the choice of RL algorithm. Furthermore, the platform demonstrates that training on diverse task categories promotes broad generalization, while narrow training can lead to negative transfer effects, especially with multi-turn interactions.

Key Contribution

Forget fancy RL algorithms—Gym-V reveals that good observation scaffolding (captions, rules) is the real key to training successful vision agents.

Abstract

As agentic systems increasingly rely on reinforcement learning from verifiable rewards, standardized ``gym'' infrastructure has become essential for rapid iteration, reproducibility, and fair comparison. Vision agents lack such infrastructure, limiting systematic study of what drives their learning and where current models fall short. We introduce \textbf{Gym-V}, a unified platform of 179 procedurally generated visual environments across 10 domains with controllable difficulty, enabling controlled experiments that were previously infeasible across fragmented toolkits. Using it, we find that observation scaffolding is more decisive for training success than the choice of RL algorithm, with captions and game rules determining whether learning succeeds at all. Cross-domain transfer experiments further show that training on diverse task categories generalizes broadly while narrow training can cause negative transfer, with multi-turn interaction amplifying all of these effects. Gym-V is released as a convenient foundation for training environments and evaluation toolkits, aiming to accelerate future research on agentic VLMs.

Computer Vision Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Gym-V: A Unified Vision Environment System for Agentic Vision Research

Related Papers