Stanford HAI

×Eval Frameworks & Benchmarks

23 papers from Stanford HAI on Eval Frameworks & Benchmarks

May 6, 2026

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

AI agents are shockingly easy to manipulate into leaking API keys, deleting user data, and initiating unauthorized transactions across a wide range of real-world applications.

Zhaorun Chen, Xun Liu, Haibo Tong +14

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

May 4, 2026

Stanford HAI2w ago

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Current LLM agents are woefully inadequate for real-world clinical tasks, achieving only 46% success on a new benchmark that demands long-horizon reasoning and verifiable execution within electronic health records.

Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler +10

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Apr 30, 2026

Stanford HAI3w ago

Optimization before Evaluation: Evaluation with Unoptimized Prompts Can be Misleading

Model rankings on standard benchmarks can flip entirely when you optimize prompts for each LLM, so your "best" model might actually be the worst.

Nicholas Sadjoli, Tim Siefken, Atin Ghosh +2

Eval Frameworks & Benchmarks Natural Language Processing

Apr 21, 2026

Stanford HAIApr 21, 2026·also Macquarie

Are Large Language Models Economically Viable for Industry Deployment?

Forget chasing the biggest LLM – this benchmark reveals that smaller models (<2B params) can deliver 3x better energy efficiency and faster ROI in real-world industry deployments.

Abdullah Mohammad, Sushant Kumar Ray, Pushkar Arora +4

Distributed Systems & Hardware Eval Frameworks & Benchmarks Inference & Quantization

Apr 20, 2026

Stanford HAIApr 20, 2026·also Google Research

FUSE: Ensembling Verifiers with Zero Labeled Data

FUSE achieves verification quality on par with semi-supervised methods, all without needing any labeled data.

Joonhyuk Lee, Virginia Ma, Sarah Zhao +4

Eval Frameworks & Benchmarks RLHF & Preference Learning

Apr 8, 2026

Stanford HAIApr 8, 2026·also Fatima Fellowship, Rochester

To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs

LLMs are significantly more likely to spread misinformation about countries with lower Human Development Index and in lower-resource languages, revealing a concerning bias in their outputs.

Zohaib Khan, Mustafa Doğan, Ifeoma Okoh +6

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Apr 5, 2026

Stanford HAIApr 5, 2026·also Microsoft Research

Effects of Generative AI Errors on User Reliance Across Task Difficulty

People aren't as bothered by AI failing at easy tasks as you might think, suggesting our expectations for AI competence are more nuanced than a simple aversion to errors.

Jacy Reese Anthis, Hannah Cha, Solon Barocas +2

Eval Frameworks & Benchmarks RLHF & Preference Learning

Mar 30, 2026

Stanford HAIMar 30, 2026·also KRAFTON

Meta-Harness: End-to-End Optimization of Model Harnesses

LLM performance hinges on the code around the model, and Meta-Harness proves that automating the design of this "harness" can significantly boost results across diverse tasks.

Yoonho Lee, Roshen Nair, Qizhen Zhang +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Mar 17, 2026

Stanford HAIMar 17, 2026·also Cornell, Georgia Tech, Ulu Lāhui Foundation

Whose Knowledge Counts? Co-Designing Community-Centered AI Auditing Tools with Educators in Hawai`i

Educators in Hawai'i envision AI auditing tools that trace the genealogy of knowledge, highlighting the need for community-centered approaches to address cultural misrepresentation in AI.

Michael J. Ryan, Angelina Wang, Evyn-Bree Helekahi-Kaiwi +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Mar 17, 2026·also Stanford HAI, Yale

Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

LLMs' chain-of-thought reasoning often falls apart due to factual incompleteness, with errors compounding across multiple hops, as revealed by a new multi-hop QA dataset.

Xiaojie Gu, Sherry T. Tong, Aosong Feng +7

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Mar 13, 2026

Stanford HAIMar 13, 2026·also OpenHands, UCR, UCSD, USC

EvoClaw: Evaluating AI Agents on Continuous Software Evolution

AI agents that ace isolated coding tasks fall apart when faced with the messy reality of continuous software evolution, dropping from 80% to 38% success rates in a new benchmark.

Gangda Deng, Zhaoling Chen, Zhongming Yu +6

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Mar 10, 2026

Stanford HAIMar 10, 2026

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

Current Large Audio Language Models (LALMs) struggle with basic audio understanding tasks like noise localization and cross-lingual speech, with some performing worse than random chance, despite excelling at speech recognition.

Laya Iyer, Sanmi Koyejo

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Mar 5, 2026

Mar 5, 2026·also Stanford HAI

AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments

AI can generate realistic legal questions, but current models still struggle with diversity and a tendency to agree too much, revealing critical gaps in their ability to simulate adversarial legal reasoning.

Kylie Zhang, Nimra Nadeem, Lucia Zheng +2

Eval Frameworks & Benchmarks Natural Language Processing

Stanford HAIMar 5, 2026·also NYU, Oumi.AI

Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

Guaranteeing reductions in harm from biased LLM judges is now possible, even when the biases are unknown or adversarially discovered.

Benjamin Feuer, Ben Feuer, Lucas Rosenblatt +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks

Mar 4, 2026

Mar 4, 2026·also Stanford HAI

HDLFORGE: A Two-Stage Multi-Agent Framework for Efficient Verilog Code Generation with Adaptive Model Escalation

Achieve 50% lower latency in Verilog code generation without sacrificing accuracy by adaptively escalating between LLMs based on diagnostic feedback and formal verification.

Armin Abdollahi, Saeid Shokoufa, Negin Ashrafi +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Stanford HAIMar 4, 2026

Using Vision + Language Models to Predict Item Difficulty

Forget expert surveys: GPT-4.1-nano can predict the difficulty of data visualization test questions with surprisingly high accuracy, especially when combining visual and textual cues.

Samin Khan

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Mar 4, 2026·also Stanford HAI, PI

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Turns out, the best memory design for robotic manipulation depends heavily on the task, with no single architecture dominating across the board.

Yinpei Dai, H. Fu, Hongze Fu +11

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Mar 1, 2026

Stanford HAIMar 1, 2026

Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact

Ensembling LLMs for educational tasks can backfire, worsening misalignment with actual learning outcomes despite improved benchmark performance.

Yunsung Kim

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Scalable Oversight & Alignment Theory

Feb 24, 2026

Feb 24, 2026·also Stanford HAI, Ministry of Education Key Laboratory of Intelligent Networks and Network Security, School of Computer Science and Technology

LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification

LLMs struggle to explore multiple valid reasoning paths, often committing to a single route and missing alternative solutions, especially in complex, multi-step logical problems.

Yanrui Wu, Jiayu Chang, Pengyu Li +2

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Stanford HAIFeb 24, 2026·also Fudan

CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

LLMs may grasp the broad strokes of causal strategies, but struggle with the devilish details of research design, as revealed by a new benchmark separating causal identification from estimation.

Ayush Sawarni, Jiyuan Tan, Vasilis Syrgkanis

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Feb 17, 2026

Stanford HAIFeb 17, 2026·also Northwestern

This human study did not involve human subjects: Validating LLM simulations as behavioral evidence

LLM-generated data can provide statistically valid causal effect estimates in social science, but only if you calibrate the simulations with real human data.

David Broska, Huaman Sun, Aaron Shaw

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Stanford HAIFeb 17, 2026·also Microsoft Research, Harvard

Prescriptive Scaling Reveals the Evolution of Language Model Capabilities

Language model capabilities are surprisingly stable over time for most tasks, except for math reasoning, which continues to advance, offering a way to reliably translate compute budgets into performance expectations.

Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis +3

Eval Frameworks & Benchmarks Scaling Laws & Emergent Abilities

Feb 1, 2025

Stanford HAIFeb 1, 2025

Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models.

A fine-tuned open-source Mistral-7B model rivals GPT-4 Turbo in extracting clinical history elements from imaging orders, offering a cost-effective and accurate alternative for assessing clinical history completeness.

David B. Larson, Arogya Koirala, Lina Y Cheuy +67

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Search

Stanford HAI