May 11 – May 18, 2026

Eval Frameworks & Benchmarks - Weekly Roundup

4 papers published across 1 lab.

3600% acceleration

Selected Labs publishing this week

Mila1

Top Papers

May 18, 2026

Mila1w ago·also CIFAR, McGill, ServiceNow

Forecasting Downstream Performance of LLMs With Proxy Metrics

Forget expensive downstream evaluations: token-level statistics from expert-written solutions can reliably forecast LLM performance with 10,000x less compute.

Arkil Patel, Siva Reddy, Marius Mosbach +1

Eval Frameworks & Benchmarks Scaling Laws & Emergent Abilities Training Efficiency & Optimization

1w ago·also Tencent AI

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Current video understanding models struggle with long-horizon robustness and non-speech audio, as revealed by the new OmniPro benchmark designed for comprehensive omni-modal proactive evaluation.

Ruixiang Zhao, Jie Yang, Zijie Xin +4

Computer Vision Eval Frameworks & Benchmarks Multimodal Models+1

May 15, 2026

Jinuk Kim +41w ago

Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation

LLMs can now be rigorously tested on their ability to generate correct chip design rule checking (DRC) scripts, thanks to a new benchmark that scores scripts based on execution, not just code similarity.

Jinuk Kim, J.S. Byun, Donghwi Hwang +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

May 14, 2026

Haoran Zhang +131w ago

$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

Current personal assistant agents struggle to anticipate and act on unstated user needs in long, complex workflows, revealing a critical gap between task completion and genuine proactivity.

Haoran Zhang, Luxin Xu, Zhilin Wang +11

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Search

Eval Frameworks & Benchmarks - Weekly Roundup

Selected Labs publishing this week

Top Papers