Ziwen Wang

Long-context LLM rankings dramatically reshuffle when evaluated across a range of context lengths and capabilities, proving that a single headline score is misleading.

Deli Huang, Cunguang Wang, Hongyin Tang +12

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

May 25, 2026

May 25, 2026·also Meituan

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Interactive world models still have a long way to go: a comprehensive benchmark reveals that even state-of-the-art models struggle to consistently perform well across video quality, interaction adherence, and physics compliance.

Kaining Ying, Hengrui Hu, Siyu Ren +5

Eval Frameworks & Benchmarks Multimodal Models World Models & Planning

Apr 13, 2026

Apr 13, 2026·also Meituan, XJU

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

LLMs that ace math and physics still struggle with general reasoning, achieving only 63% accuracy on a new K-12 level benchmark.

Shengnan An, Shuang Zhou, Dan Ma +5

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought