Yuchen Li

Achieving new state-of-the-art scores in deep research benchmarks, DuMate-DeepResearch redefines the capabilities of multi-agent systems in tackling complex research tasks.

Lingyong Yan, Yukun Zhao, Wenxuan Li +12

Reasoning & Chain-of-Thought Scalable Oversight & Alignment Theory Tool Use & Agents

Jun 1, 2026

Tsinghua AIJun 1, 2026·also CAS, PKU

Joint Agent Memory and Exploration Learning via Novelty Signals

Novelty-driven interaction enables agents to explore more effectively while using memory efficiently, outperforming traditional methods in open-ended environments.

Shizuo Tian, Xiaohong Weng, Rui Kong +8

Tool Use & Agents World Models & Planning

May 28, 2026

Jiamin Chen +7May 28, 2026·also CAS, HKU

DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

Long-form video generation struggles with transitions, scoring only 0.256 on transition quality even when prompt fulfillment is high (0.71), revealing a critical bottleneck exposed by the new DirectorBench diagnostic benchmark.

Jiamin Chen, Qianben Chen, Jiawen Zhang +5

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Jiamin Chen +8May 28, 2026·also CAS, HKU

SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?

Saturated LLM benchmarks can be revived without creating new datasets: a self-improving LLM judge in an elimination tournament recovers ranking signal and breaks ties.

Jiamin Chen, Yidi Wu, Qiexiang Wang +6

Eval Frameworks & Benchmarks Natural Language Processing

May 27, 2026

Lusha Wang +3May 27, 2026·also CAS

Chinese Word Boundary Recovery through Character Alignment Projection

Alignment-based projection offers a surprisingly effective way to fix broken Chinese word boundaries in noisy text, outperforming direct segmentation and stabilizing annotation pipelines.

Lusha Wang, Yuchen Li, Su Yuan +1

Data Curation & Synthetic Data Natural Language Processing