Microsoft ResearchJun 22, 2025arXiv:2506.17967

Adapting Vision-Language Models for Evaluating World Models

Mariya Hendriksen, Tabish Rashid, David Bignell, Raluca Georgescu, Abdelhak Lemkhenter, Katja Hofmann, Sam Devlin, Sarah Parisot

AI Summary

The paper introduces UNIVERSE, a VLM-based evaluator, to address the challenge of evaluating video world model rollouts by assessing action alignment and semantic consistency. They adapt VLMs under data and compute constraints using full, partial, and parameter-efficient methods across various task formats and environments. The resulting UNIVERSE evaluator achieves parity with task-specific checkpoints and demonstrates strong alignment with human judgments in action and character recognition tasks.

Key Contribution

VLMs can be effectively adapted, even under data and compute constraints, to create a unified evaluator for video world models that rivals task-specific models and aligns well with human judgment.

Abstract

World models - generative models that simulate environment dynamics conditioned on past observations and actions - are gaining prominence in planning, simulation, and embodied AI. However, evaluating their rollouts remains a fundamental challenge, requiring fine-grained, temporally grounded assessment of action alignment and semantic consistency - capabilities not captured by existing metrics. Vision-Language Models (VLMs) have shown promise as automatic evaluators of generative content due to their strong multimodal reasoning abilities. Yet, their use in fine-grained, temporally sensitive evaluation tasks remains limited and requires targeted adaptation. We introduce an evaluation protocol targeting two recognition tasks - action recognition and character recognition - each assessed across binary, multiple-choice, and open-ended formats. To support this, we present UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a VLM-based evaluator for video world model rollouts adapted under data and compute constraints. In our extensive experiments totaling over 5,154 GPU-days, we explore full, partial, and parameter-efficient adaptation methods across various task formats, context lengths, sampling methods, and data compositions. The resulting unified evaluator achieves parity with task-specific checkpoints. Human studies across seven diverse environments confirm strong alignment with human judgments, establishing UNIVERSE as a lightweight, adaptable, and semantics-aware evaluator for video world models.

Eval Frameworks & Benchmarks Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References114

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Adapting Vision-Language Models for Evaluating World Models

Related Papers