Search papers, labs, and topics across Lattice.
The paper introduces UNIVERSE, a VLM-based evaluator, to address the challenge of evaluating video world model rollouts by assessing action alignment and semantic consistency. They adapt VLMs under data and compute constraints using full, partial, and parameter-efficient methods across various task formats and environments. The resulting UNIVERSE evaluator achieves parity with task-specific checkpoints and demonstrates strong alignment with human judgments in action and character recognition tasks.
VLMs can be effectively adapted, even under data and compute constraints, to create a unified evaluator for video world models that rivals task-specific models and aligns well with human judgment.
World models - generative models that simulate environment dynamics conditioned on past observations and actions - are gaining prominence in planning, simulation, and embodied AI. However, evaluating their rollouts remains a fundamental challenge, requiring fine-grained, temporally grounded assessment of action alignment and semantic consistency - capabilities not captured by existing metrics. Vision-Language Models (VLMs) have shown promise as automatic evaluators of generative content due to their strong multimodal reasoning abilities. Yet, their use in fine-grained, temporally sensitive evaluation tasks remains limited and requires targeted adaptation. We introduce an evaluation protocol targeting two recognition tasks - action recognition and character recognition - each assessed across binary, multiple-choice, and open-ended formats. To support this, we present UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a VLM-based evaluator for video world model rollouts adapted under data and compute constraints. In our extensive experiments totaling over 5,154 GPU-days, we explore full, partial, and parameter-efficient adaptation methods across various task formats, context lengths, sampling methods, and data compositions. The resulting unified evaluator achieves parity with task-specific checkpoints. Human studies across seven diverse environments confirm strong alignment with human judgments, establishing UNIVERSE as a lightweight, adaptable, and semantics-aware evaluator for video world models.