Search papers, labs, and topics across Lattice.
26 papers from Stanford HAI on Eval Frameworks & Benchmarks
Stop hand-coding your LLM harnesses: Meta-Harness can automatically discover harnesses that outperform state-of-the-art systems while using fewer context tokens and generalizing across models.
Robots often ignore your commands mid-task, but ReSteer offers a way to fix this by pinpointing and patching the "blind spots" in their training data.
Automating surgical patient triage with an LLM achieves 94% sensitivity, but discrepancies reveal more about clinical workflow gaps than AI errors.
Educators in Hawai'i envision AI auditing tools that trace the genealogy of knowledge, highlighting the need for community-centered approaches to address cultural misrepresentation in AI.
LLMs' chain-of-thought reasoning often falls apart due to factual incompleteness, with errors compounding across multiple hops, as revealed by a new multi-hop QA dataset.
LLM agents struggle to maintain coherent decision-making in realistic retail environments over long horizons, even with a novel framework for adaptive strategy evolution.
Most AI failures aren't the spectacular kind, but silent breakdowns in interaction that will persist even as models get smarter.
AI agents that ace isolated coding tasks fall apart when faced with the messy reality of continuous software evolution, dropping from 80% to 38% success rates in a new benchmark.
Current Large Audio Language Models (LALMs) struggle with basic audio understanding tasks like noise localization and cross-lingual speech, with some performing worse than random chance, despite excelling at speech recognition.
AI can generate realistic legal questions, but current models still struggle with diversity and a tendency to agree too much, revealing critical gaps in their ability to simulate adversarial legal reasoning.
Guaranteeing reductions in harm from biased LLM judges is now possible, even when the biases are unknown or adversarially discovered.
Achieve 50% lower latency in Verilog code generation without sacrificing accuracy by adaptively escalating between LLMs based on diagnostic feedback and formal verification.
Forget expert surveys: GPT-4.1-nano can predict the difficulty of data visualization test questions with surprisingly high accuracy, especially when combining visual and textual cues.
Turns out, the best memory design for robotic manipulation depends heavily on the task, with no single architecture dominating across the board.
Forget OCR? Powerful MLLMs can extract information from business documents just as well from images alone, challenging the necessity of traditional OCR pipelines.
Model handoffs in multi-turn LLM systems can swing performance by up to 13 percentage points, revealing a hidden reliability risk that single-model benchmarks miss.
Ensembling LLMs for educational tasks can backfire, worsening misalignment with actual learning outcomes despite improved benchmark performance.
An interactive AI can fairly evaluate skills across diverse self-presentation styles, ensuring equitable outcomes even when individuals differ in their tendency towards self-promotion or modesty.
LLMs struggle to explore multiple valid reasoning paths, often committing to a single route and missing alternative solutions, especially in complex, multi-step logical problems.
LLMs may grasp the broad strokes of causal strategies, but struggle with the devilish details of research design, as revealed by a new benchmark separating causal identification from estimation.
Airavat automates expert-level Internet measurement, catching methodological flaws that traditional testing misses.
LLM-generated data can provide statistically valid causal effect estimates in social science, but only if you calibrate the simulations with real human data.
Language model capabilities are surprisingly stable over time for most tasks, except for math reasoning, which continues to advance, offering a way to reliably translate compute budgets into performance expectations.
A clinical reasoning system using curated evidence beats GPT-5 on endocrinology board exams, suggesting that domain-specific knowledge beats raw LLM scale in specialized fields.
Despite progress in AI safety, it's still largely unknown how effective current safeguards are at preventing AI harms, and their effectiveness varies wildly.
A fine-tuned open-source Mistral-7B model rivals GPT-4 Turbo in extracting clinical history elements from imaging orders, offering a cost-effective and accurate alternative for assessing clinical history completeness.