MIT CSAIL

×Eval Frameworks & Benchmarks

13 papers from MIT CSAIL on Eval Frameworks & Benchmarks

May 6, 2026

Implicit Representations of Grammaticality in Language Models

LMs encode grammaticality as a distinct feature in their hidden representations, separable from raw string probability and generalizable across languages.

Yingshan Susan Wang, Linlu Qiu, Zhaofeng Wu +2

Eval Frameworks & Benchmarks Natural Language Processing

Apr 20, 2026

MIT CSAILApr 20, 2026

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Even the best LLMs still stumble on Olympiad-level math, and retrieval quality is the bottleneck for retrieval-augmented problem solving, according to the new MathNet benchmark.

Shaden Alshammari, Shaden Alshammari, Kevin Wen +13

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Apr 13, 2026

MIT CSAILApr 13, 2026·also Harvard

Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models

LLMs play favorites: GPT-5-nano is significantly more likely to agree with incorrect statements depending on the perceived race, age, gender, and confidence of the user.

Benjamin Maltbie, Shivam Raval

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Zhiling ResearchApr 13, 2026·also MIT CSAIL, Central South University, PKU, Tongji +1

Ozone: A Unified Platform for Transportation Research

Stop wasting time wrestling incompatible transportation datasets: Ozone slashes experiment setup by 85% and boosts cross-city transfer of safety models by 91%.

Ou Zheng, Ruyi Feng, Yufeng Yang +9

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Open-Source Models & Weights

Apr 9, 2026

Microsoft ResearchApr 9, 2026·also MIT CSAIL

From Gaze to Guidance: Interpreting and Adapting to Users'Cognitive Needs with Multimodal Gaze-Aware AI Assistants

Gaze-tracking unlocks a new level of personalized AI assistance, enabling LLMs to infer user cognitive states and boost recall performance.

Valdemar Danry, Javier Hernandez, Andrew Wilson +3

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing+1

Apr 8, 2026

Apr 8, 2026·also MIT CSAIL, IBM Research, UCLA

Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

Soft-gating with an "advisor" model can steer LLMs to be safer and more useful, reducing over-refusal without sacrificing detection accuracy.

Haomin Zhuang, Jiayi Ye, Yanbo Wang +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Apr 6, 2026

Apr 6, 2026·also MIT CSAIL, UC Santa Barbara

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

LLM agent skills, despite their promise, often fail in realistic settings, with performance plummeting to no-skill baselines when agents must retrieve skills from a large, uncurated collection.

Yujian Liu, Jiabao Ji, Li An +4

Eval Frameworks & Benchmarks Tool Use & Agents

Mar 12, 2026

MIT CSAILMar 12, 2026

Algorithmic Consequences of Particle Filters for Sentence Processing: Amplified Garden-Paths and Digging-In Effects

Particle filter models of sentence processing inherently predict "digging-in" effects—where disambiguation difficulty increases with the length of the ambiguous region—a phenomenon not captured by surprisal-based models.

Amani Maina-Kilaas, Roger Levy

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

MIT CSAILMar 12, 2026

To Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading Times

Fine-tuning unlocks LLMs' surprising ability to predict how memorable a sentence is and how long it takes to read, exceeding traditional methods.

Thomas Hikaru Clark, Carlos Arriaga, Javier Conde +3

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Mar 4, 2026

Vals AIMar 4, 2026·also MIT CSAIL

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Building a complete web application from scratch remains a surprisingly hard task for even the best AI models, with top performance at only 58% accuracy on a new end-to-end benchmark.

Hung Tran, Langston Nashold, Rayan Krishnan +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

CMU MLMar 4, 2026·also BAIR, MIT CSAIL, NVIDIA, Tsinghua AI +11

ManipulationNet: An Infrastructure for Benchmarking Real-World Robot Manipulation with Physical Skill Challenges and Embodied Multimodal Reasoning

Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.

Kenneth Kimble, Kenny Kimble, Edward H. Adelson +23

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Feb 19, 2026

MIT CSAILFeb 19, 2026·also BAIR

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

VLMs are nowhere near human-level general intelligence: they score less than 10% of human performance across a diverse set of human-designed games, especially struggling with world-model learning, memory, and planning.

Lance Ying, Lance Ying, Ryan Truong +18

Eval Frameworks & Benchmarks Scalable Oversight & Alignment Theory Tool Use & Agents

Feb 10, 2026

MIT CSAILFeb 10, 2026·also IBM Research, UCF, York

How Much Reasoning Do Retrieval-Augmented Models Add beyond LLMs? A Benchmarking Framework for Multi-Hop Inference over Hybrid Knowledge

HybridRAG-Bench reveals that existing benchmarks overestimate the reasoning abilities of retrieval-augmented LLMs due to contamination, offering a more realistic evaluation using up-to-date scientific knowledge.

Junhong Lin, Bing Zhang, Song Wang +4

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Search

MIT CSAIL