Search papers, labs, and topics across Lattice.
Gym-V is introduced as a unified platform comprising 179 procedurally generated visual environments across 10 domains, designed to facilitate controlled experiments in agentic vision research. The study reveals that observation scaffolding, such as captions and game rules, significantly impacts training success, outweighing the choice of RL algorithm. Furthermore, the platform demonstrates that training on diverse task categories promotes broad generalization, while narrow training can lead to negative transfer effects, especially with multi-turn interactions.
Forget fancy RL algorithms—Gym-V reveals that good observation scaffolding (captions, rules) is the real key to training successful vision agents.
As agentic systems increasingly rely on reinforcement learning from verifiable rewards, standardized ``gym'' infrastructure has become essential for rapid iteration, reproducibility, and fair comparison. Vision agents lack such infrastructure, limiting systematic study of what drives their learning and where current models fall short. We introduce \textbf{Gym-V}, a unified platform of 179 procedurally generated visual environments across 10 domains with controllable difficulty, enabling controlled experiments that were previously infeasible across fragmented toolkits. Using it, we find that observation scaffolding is more decisive for training success than the choice of RL algorithm, with captions and game rules determining whether learning succeeds at all. Cross-domain transfer experiments further show that training on diverse task categories generalizes broadly while narrow training can cause negative transfer, with multi-turn interaction amplifying all of these effects. Gym-V is released as a convenient foundation for training environments and evaluation toolkits, aiming to accelerate future research on agentic VLMs.