Allen Institute for AI (AI2)

×Eval Frameworks & Benchmarks

6 papers from Allen Institute for AI (AI2) on Eval Frameworks & Benchmarks

Apr 29, 2026

AI23w ago

Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations

LLMs often withhold helpful information due to misinterpreting user intent, but multi-turn conversations can unlock utility—at a cost of new failure modes like "utility lock-in" and "unsafe recovery" that single-turn benchmarks miss.

Mingqian Zheng, Malia Morgan, Liwei Jiang +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Apr 14, 2026

Apr 14, 2026·also AI2, HUJI, Technion

Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

Stop re-running full benchmarks: Calibrate new LLM datasets against existing suites with just 100 "anchor" questions and still get highly accurate performance predictions.

Asaf Yehudai, Yotam Perlitz, Elron Bandel +2

Eval Frameworks & Benchmarks Training Efficiency & Optimization

Mar 17, 2026

AI2Mar 17, 2026·also Alongside.care

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

Synthetic benchmarks can't catch the nuances of personalized deep research, as real users revealed nine critical errors that LLM judges missed entirely.

Nishant Balepur, Nishant Balepur, Malachi Hamada +14

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Mar 2, 2026

AI2Mar 2, 2026·also UW, Fred Hutchinson Cancer Center, Independent Researcher, Pancreatic Cancer Action Network

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

LLMs still struggle with factual accuracy in specialized medical domains like pancreatic cancer, with hallucination rates varying wildly and web search integration failing to guarantee better responses.

Scott Geng, Fatima Zelada-arenas, Alejandra Alvarez +6

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

Feb 11, 2026

AI2Feb 11, 2026·also CMU ML, NVIDIA, UW

MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation

Forget synthetic benchmarks that don't translate: MolmoSpaces offers 230k diverse, simulator-agnostic environments with 130k annotated objects, showing a remarkable 0.96 sim-to-real correlation for robot policies.

Wilbert Pumacay, Omar Rayyan, Max Argus +22

Eval Frameworks & Benchmarks Robotics & Embodied AI World Models & Planning

Jun 2, 2025

AI2Jun 2, 2025·also UW

RewardBench 2: Advancing Reward Model Evaluation

RewardBench 2 exposes a stark reality check for reward models: they struggle significantly on new, human-generated prompts, yet this difficulty is surprisingly predictive of their actual usefulness in downstream tasks.

Saumya Malik, Valentina Pyatkin, Sander Land +453

Eval Frameworks & Benchmarks RLHF & Preference Learning

Search

Allen Institute for AI (AI2)