Microsoft Research

×Eval Frameworks & Benchmarks

17 papers from Microsoft Research on Eval Frameworks & Benchmarks

May 6, 2026

SoK: Robustness in Large Language Models against Jailbreak Attacks

Current LLM jailbreak evaluations are inadequate, often relying on narrow metrics, necessitating a multi-dimensional framework like Security Cube for comprehensive security assessment.

Feiyue Xu, Hongsheng Hu, Chaoxiang He +9

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Apr 22, 2026

Apr 22, 2026·also Microsoft Research, KAUST, Northeastern, University of Missouri

LAFA: A Framework for Reproducible Longitudinal Assessment of Protein Function Annotation Models

Continuous benchmarking of protein function prediction models is now possible, enabling faster iteration and more robust performance tracking as annotations evolve.

An Phan, Frimpong Boadu, Maxat Kulmanov +4

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design

Apr 20, 2026

Microsoft ResearchApr 20, 2026

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

Token-level attribution struggles to pinpoint the causes of LLM failures in realistic settings, suggesting current interpretability tools may not be up to the task of debugging complex model behaviors.

Rongyuan Tan, Jue Zhang, Zhuozhao Li +4

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Apr 19, 2026

Apr 19, 2026·also Microsoft Research, UofT

Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

Despite impressive unit test pass rates, today's best LLMs rewrite code instead of precisely debugging it, achieving less than 45% edit precision even when explicitly instructed to minimize changes.

Wang Bill Zhu, Miaosen Chai, Shangshang Wang +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Apr 13, 2026

Apr 13, 2026·also Microsoft Research, UW

Discourse Diversity in Multi-Turn Empathic Dialogue

LLMs are twice as likely as humans to repeat the same support tactic in a conversation, but a simple RL reward for tactic novelty can fix it.

Hongli Zhan, Emma S. Gueorguieva, Javier Hernandez +2

Eval Frameworks & Benchmarks Natural Language Processing

Apr 9, 2026

Microsoft ResearchApr 9, 2026·also MIT CSAIL

From Gaze to Guidance: Interpreting and Adapting to Users'Cognitive Needs with Multimodal Gaze-Aware AI Assistants

Gaze-tracking unlocks a new level of personalized AI assistance, enabling LLMs to infer user cognitive states and boost recall performance.

Valdemar Danry, Javier Hernandez, Andrew Wilson +3

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing+1

Microsoft ResearchApr 9, 2026·also Georgia Tech, Virginia Tech

ORACLE-SWE: Quantifying the Contribution of Oracle Information Signals on SWE Agents

Knowing the *perfect* API to use or *exact* location to edit could drastically improve SWE agent performance, but knowing the perfect regression test result? Not so much.

Kenan Li, Qirui Jin, Liao Zhu +16

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Apr 5, 2026

Stanford HAIApr 5, 2026·also Microsoft Research

Effects of Generative AI Errors on User Reliance Across Task Difficulty

People aren't as bothered by AI failing at easy tasks as you might think, suggesting our expectations for AI competence are more nuanced than a simple aversion to errors.

Jacy Reese Anthis, Hannah Cha, Solon Barocas +2

Eval Frameworks & Benchmarks RLHF & Preference Learning

Apr 2, 2026

Microsoft ResearchApr 2, 2026·also Columbia, UvA

LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches

LLMs still fail to grasp research-level mathematics, with top models scoring below random chance when superficial pattern matching is removed, even with access to proof sketches.

Linyang He, Qiyao Yu, Hanze Dong +5

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Mar 17, 2026

Microsoft ResearchMar 17, 2026

Intent Formalization: A Grand Challenge for Reliable Coding in the Age of AI Agents

AI-generated code's fluency masks a critical flaw: it often fails to deliver what users actually intend, highlighting the urgent need for "intent formalization" to bridge the gap between informal requirements and precise program behavior.

Shuvendu K. Lahiri

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Mar 17, 2026·also Microsoft Research, USC

Evaluating LLM-Simulated Conversations in Modeling Inconsistent and Uncollaborative Behaviors in Human Social Interaction

LLMs, even when prompted or fine-tuned, struggle to replicate the messy reality of human conversation, raising serious questions about their utility as proxies for social interaction.

Ryo Kamoi, Ameya Godbole, Longqi Yang +2

Eval Frameworks & Benchmarks Natural Language Processing

Mar 16, 2026

Microsoft ResearchMar 16, 2026

The Hrunting of AI: Where and How to Improve English Dialectal Fairness

LLMs' ability to fairly represent English dialects hinges on the quality of human consensus, revealing a fundamental challenge in improving performance for low-resource locales.

Adrian de Wynter

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Mar 10, 2026

Microsoft ResearchMar 10, 2026·also School of Artificial Intelligence

CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research?

LLMs still can't automate real-world threat research, struggling with accuracy and nuanced expertise in a new benchmark derived from a world-leading company's CTI workflow.

Xiangsen Chen, Shuo Chen, Matthieu Maitre +3

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Mar 6, 2026

Microsoft ResearchMar 6, 2026

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

LLMs writing long stories frequently contradict themselves on basic facts and timelines, especially in the middle of the narrative, highlighting a critical weakness in long-form generation.

Junjie Li, Xinru Guo, Xinrui Guo +7

Eval Frameworks & Benchmarks Natural Language Processing

Mar 1, 2026

Mar 1, 2026·also Microsoft Research, Rutgers, Shanghai Key Laboratory of Multimodal

Individual Turing Test: A Case Study of LLM-based Simulation Using Longitudinal Personal Data

LLMs can mimic your style, but your friends can still tell it's not really you, especially when it comes to your opinions.

Ziyi Ye, Xi Zhu, Dimitris N. Metaxas

Eval Frameworks & Benchmarks Natural Language Processing

Feb 17, 2026

Microsoft ResearchFeb 17, 2026·also SambaNova

The Limits of Long-Context Reasoning in Automated Bug Fixing

LLMs can't reliably debug code in long contexts (64k-128k tokens) even with perfect information retrieval, despite impressive performance in agentic workflows that decompose the task.

Ravi Raju, Mengmeng Ji, Shubhangi Upasani +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Feb 16, 2026

Computer Science and EngineeringFeb 16, 2026·also Microsoft Research, Notre Dame

Key Considerations for Domain Expert Involvement in LLM Design and Evaluation: An Ethnographic Study

LLM development teams often resort to workarounds and augmentation strategies when faced with the practical challenges of integrating domain experts, revealing a gap between ideal participatory design and real-world constraints.

Annalisa Szymanski, Oghenemaro Anuyah, Toby Jia-Jun Li

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Search

Microsoft Research