Xuezhi Cao

Long-context LLM rankings dramatically reshuffle when evaluated across a range of context lengths and capabilities, proving that a single headline score is misleading.

Deli Huang, Cunguang Wang, Hongyin Tang +14

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

May 25, 2026

2w ago·also Independent researchers *Equally, Meituan

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Interactive world models still have a long way to go: a comprehensive benchmark reveals that even state-of-the-art models struggle to consistently perform well across video quality, interaction adherence, and physics compliance.

Kaining Ying, Hengrui Hu, Siyu Ren +6

Eval Frameworks & Benchmarks Multimodal Models World Models & Planning

Search

Xuezhi Cao

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (3)