CMU Machine Learning

×Eval Frameworks & Benchmarks

28 papers from CMU Machine Learning on Eval Frameworks & Benchmarks

Apr 28, 2026

CMU ML3w ago·also NVIDIA, Georgia Tech, Princeton

KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

Existing robotic methods falter in tackling fundamental physical reasoning challenges, as evidenced by KinDER's rigorous benchmark evaluation.

Yixuan Huang, Bowen Li, Vaibhav Saxena +12

Eval Frameworks & Benchmarks Robotics & Embodied AI World Models & Planning

Apr 27, 2026

CMU ML3w ago

Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

Today's best web agents are shockingly inefficient, achieving only 1.15% trajectory efficiency on realistic long-horizon tasks, revealing a critical need to move beyond simple success rates.

Lawrence Keunho Jang, L. Jang, Jing Yu Koh +5

Eval Frameworks & Benchmarks Tool Use & Agents

Apr 22, 2026

CMU MLApr 22, 2026

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Continual learning for LLM agents hits a wall: scaling models doesn't reliably improve skill generation, and self-feedback can lead to recursive drift.

Shanshan Zhong, Shan Zhong, Yiming Lu +17

Eval Frameworks & Benchmarks Robotics & Embodied AI Tool Use & Agents

Apr 19, 2026

CMU MLApr 19, 2026·also Microsoft Research

HORIZON: A Benchmark for In-the-wild User Behaviour Modeling

Current user modeling benchmarks are child's play compared to the real-world challenges exposed by HORIZON, a massive new dataset spanning 54M users and diverse domains.

Pranjal A Chitale, Bhawna Paliwal, Bishal Santra +1

Eval Frameworks & Benchmarks Recommendation & Information Retrieval

CMU MLApr 19, 2026·also Fewshot Corp, Independent Researcher

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

Frontier LLMs are surprisingly vulnerable to a wide range of task-specific exploits, from simple output spoofing to rootkit-style binary hijacking, even in seemingly well-defined environments.

Ivan Bercovich, I. Bercovich, Ivgeni Segal +4

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Apr 16, 2026

CMU MLApr 16, 2026·also Max Planck, UofT, Vector

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

Forget carrots and sticks: contracts and mediation are the surprisingly effective keys to unlocking cooperation between LLMs, even when individual incentives push toward defection.

Emanuel Tewolde, Xiao Zhang, David Guzman Piedrahita +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Apr 16, 2026·also CMU ML

Scaling Test-Time Compute for Agentic Coding

Agentic coding gets a serious boost: distilling and reusing rollout trajectories lets Claude-4.5-Opus jump from 70.9% to 77.6% on SWE-Bench Verified.

Joongwon Kim, Wannan Yang, Kelvin Niu +13

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Scaling Laws & Emergent Abilities+1

Apr 15, 2026

CMU MLApr 15, 2026·also UMass

Evaluation of Agents under Simulated AI Marketplace Dynamics

Stop evaluating AI systems in isolation: marketplace dynamics like user switching and early-adoption advantages critically shape real-world success.

To Eun Kim, Alireza Salemi, Hamed Zamani +1

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Tool Use & Agents

CMU MLApr 15, 2026

Interpretable Stylistic Variation in Human and LLM Writing Across Genres, Models, and Decoding Strategies

LLMs can mimic human writing, but not as well as you think: genre matters more than the source (human vs. LLM), and model choice trumps decoding strategy when it comes to style.

Swati Rallapalli, Swati Rallapalli, Shannon Gallagher +11

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Apr 14, 2026

CMU MLApr 14, 2026

When Does Data Augmentation Help? Evaluating LLM and Back-Translation Methods for Hausa and Fongbe NLP

Data augmentation with LLMs can tank your NER performance even when it boosts POS tagging, proving task structure matters more than synthetic data quality.

Mahounan Pericles Adjovi, Roald Eiselen, Prasenjit Mitra

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Apr 13, 2026

Apr 13, 2026·also CMU ML, Tsinghua AI

DDO-RM for LLM Preference Optimization: A Minimal Held-Out Benchmark against DPO

DPO might not be the only game in town: a decision-directed approach to reward modeling can outperform it in pairwise preference optimization.

Tiantian Zhang, Jierui Zuo

Eval Frameworks & Benchmarks RLHF & Preference Learning

CMU MLApr 13, 2026

Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?

Interpretability methods often fail to improve over black-box prompting when models are uncooperative, suggesting current techniques may be more about elicitation than revealing internal mechanisms.

Aashiq Muhamed, Mona T. Diab, Virginia Smith

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

CMU MLApr 13, 2026·also Northwestern, Stony Brook, Yale

Seeing Through the Tool: A Controlled Benchmark for Occlusion Robustness in Foundation Segmentation Models

SAM models exhibit surprisingly divergent behaviors under occlusion, with some prioritizing visible tissue and others confidently hallucinating hidden anatomy.

Nhan Ho, Luu Le, Thanh-Huy Nguyen +2

Computer Vision Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Apr 12, 2026

CMU MLApr 12, 2026·also UW-Madison

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

Stop reimplementing multimodal models: TorchUMM offers a unified codebase for evaluation, analysis, and post-training, streamlining research across diverse architectures and tasks.

Yinyi Luo, Wenwen Wang, Hayes Bai +5

Eval Frameworks & Benchmarks Multimodal Models Open-Source Models & Weights

Apr 9, 2026

CMU MLApr 9, 2026·also Tsinghua AI, Waterloo

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Today's best AI agents can only complete 33% of common online tasks like booking appointments or filling out job applications, revealing a significant gap between current capabilities and real-world utility.

Yuxuan Zhang, Yubo Wang, Yipeng Zhu +19

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Apr 7, 2026

CMU MLApr 7, 2026·also Stanford HAI, Harvard, Institute for the Study of Natural and Artificial, SJTU

MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

LLMs struggle to synthesize scientific conclusions from structured biomedical evidence, and current metrics fail to capture nuanced differences in their reasoning abilities.

Weiyue Li, Ruizhi Qian, Ruizhi Qian +7

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing+1

Apr 6, 2026

CMU MLApr 6, 2026·also EuroSafeAI, Max Planck, UofT, Vector

From Hallucination to Scheming: A Unified Taxonomy and Benchmark Analysis for LLM Deception

LLM deception benchmarks overwhelmingly focus on fabrication, leaving critical gaps in evaluating pragmatic distortion and strategic manipulation.

Jerick Shi, Terry Jingchen Zhang, Terry Jingcheng Zhang +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

CMU MLApr 6, 2026·also MIT CSAIL, Oxford, University of California

AI Assistance Reduces Persistence and Hurts Independent Performance

Just 10 minutes of AI assistance can measurably degrade your ability to solve problems on your own.

Grace Liu, Brian Christian, Tsvetomira Dumbalska +2

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Apr 1, 2026

CMU MLApr 1, 2026

Do Agents Repair When Challenged -- or Just Reply? Challenge, Repair, and Public Correction in a Deployed Agent Forum

LLM-powered forums may generate norm-aware language, but they fail to foster the crucial back-and-forth needed for communities to teach, enforce, and revise those norms.

Luyang Zhang, Yi-Yun Chu, Jialu Wang +2Code

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Mar 10, 2026

Google ResearchMar 10, 2026·also CMU ML, DeepMind, Harvard

Think Before You Lie: How Reasoning Improves Honesty

LLMs get *more* honest when they have time to reason, defying human tendencies and revealing surprising insights about their internal representational geometry.

Ann Yuan, Asma Ghandeharioun, Carter Blum +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Mar 5, 2026

CMU MLMar 5, 2026

SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis

Finally, a standardized benchmark for survival analysis HTE estimation lets you rigorously compare methods across synthetic, semi-synthetic, and real-world datasets.

Shahriar Noroozizadeh, Xiaobin Shen, Jeremy C. Weiss +1

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design

Mar 4, 2026

CMU MLMar 4, 2026·also BAIR, MIT CSAIL, NVIDIA, Tsinghua AI +11

ManipulationNet: An Infrastructure for Benchmarking Real-World Robot Manipulation with Physical Skill Challenges and Embodied Multimodal Reasoning

Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.

Kenny Kimble, Kenneth Kimble, Edward H. Adelson +23

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Mar 2, 2026

CMU MLMar 2, 2026·also Independent Researcher

ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense

Today's frontier LLMs can't autonomously patch critical zero-day vulnerabilities, revealing a significant gap in their cyberdefense capabilities.

Louis Sloot, Jyoutir Raj, Giuseppe Marco Boscardin +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Feb 17, 2026

CMU MLFeb 17, 2026·also UMich, UofT

Transforming GenAI Policy to Prompting Instruction: An RCT of Scalable Prompting Interventions in a CS1 Course

Want to boost student performance in the age of GenAI? This RCT proves that scalable prompting interventions, grounded in the ICAP framework, can significantly improve student prompting skills and, ultimately, exam scores.

Ruiwei Xiao, Runlong Ye, Xinying Hou +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

CMU MLFeb 17, 2026

ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models

MLLMs struggle with multi-turn chart editing, forgetting context and accumulating errors, especially when the edits involve data transformations, not just styling.

Manav Nitin Kapadnis, Lawanya Baghel, Carolyn Rosé

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Multimodal Models

Feb 16, 2026

IITFeb 16, 2026·also CMU ML

ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

VLMs that ace RGB images completely fail at thermal imagery, revealing a critical gap in their ability to reason about temperature and physical properties.

Ayush Shrivastava, Kirtan Gangani, Laksh Jain +2

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Feb 11, 2026

CMU MLFeb 11, 2026·also Princeton

GameDevBench: Evaluating Agentic Capabilities Through Game Development

Multimodal agents still struggle with game development, solving only ~50% of tasks in a new benchmark, GameDevBench, highlighting the need for better multimodal reasoning in complex software environments.

Wayne Chi, Wayne Chi, Yixiong Fang +16

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Jan 6, 2026

CMU MLJan 6, 2026

AfriEconQA: A Benchmark Dataset for African Economic Analysis based on World Bank Reports

LLMs' impressive general knowledge evaporates when faced with African economic data, as even advanced RAG pipelines struggle to answer questions based on World Bank reports, revealing a stark domain-specific knowledge gap.

Edward Ajayi

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Search

CMU Machine Learning