FudanIndependent researchers *EquallyMeituanMay 25, 2026arXiv:2605.25874

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Kaining Ying, Hengrui Hu, Siyu Ren, Jiamu Li, Fengjiao Chen, Ziwen Wang, Xuezhi Cao, Xunliang Cai, Henghui Ding

AI Summary

The authors introduce WBench, a new multi-turn benchmark for evaluating interactive video world models across five dimensions: video quality, setting adherence, interaction adherence, consistency, and physics compliance. WBench comprises 289 test cases with 1,058 interaction turns, covering diverse scenes and interaction types, and unifies different control interfaces for navigation. Evaluation leverages 22 automatic sub-metrics validated against human judgments, revealing that no single model excels across all dimensions and providing diagnostic insights into model strengths and weaknesses.

Key Contribution

Interactive world models still have a long way to go: a comprehensive benchmark reveals that even state-of-the-art models struggle to consistently perform well across video quality, interaction adherence, and physics compliance.

Abstract

Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model evaluation along five dimensions, namely video quality, setting adherence, interaction adherence, consistency, and physics compliance. WBench contains 289 test cases and 1,058 interaction turns, where each case specifies a world setting and a multi-turn interaction sequence, covering diverse scenes, styles, subjects, and both first- and third-person perspectives, together with four interaction types, including navigation, subject action, event editing, and perspective switching. For navigation, WBench unifies text, 6-DoF pose, and discrete-action control, enabling evaluation of models with different native input interfaces. Evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments. Across 20 state-of-the-art models, we find that no single model performs strongly across all dimensions. We provide detailed diagnostic insights into the characteristic strengths, weaknesses, and open challenges of each model. Code and data are available at https://github.com/meituan-longcat/WBench.

Eval Frameworks & Benchmarks Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Related Papers