Search papers, labs, and topics across Lattice.
23 papers from Stanford HAI on Eval Frameworks & Benchmarks
AI agents are shockingly easy to manipulate into leaking API keys, deleting user data, and initiating unauthorized transactions across a wide range of real-world applications.
Current LLM agents are woefully inadequate for real-world clinical tasks, achieving only 46% success on a new benchmark that demands long-horizon reasoning and verifiable execution within electronic health records.
Model rankings on standard benchmarks can flip entirely when you optimize prompts for each LLM, so your "best" model might actually be the worst.
Forget chasing the biggest LLM – this benchmark reveals that smaller models (<2B params) can deliver 3x better energy efficiency and faster ROI in real-world industry deployments.
FUSE achieves verification quality on par with semi-supervised methods, all without needing any labeled data.
LLMs are significantly more likely to spread misinformation about countries with lower Human Development Index and in lower-resource languages, revealing a concerning bias in their outputs.
People aren't as bothered by AI failing at easy tasks as you might think, suggesting our expectations for AI competence are more nuanced than a simple aversion to errors.
LLM performance hinges on the code around the model, and Meta-Harness proves that automating the design of this "harness" can significantly boost results across diverse tasks.
Educators in Hawai'i envision AI auditing tools that trace the genealogy of knowledge, highlighting the need for community-centered approaches to address cultural misrepresentation in AI.
LLMs' chain-of-thought reasoning often falls apart due to factual incompleteness, with errors compounding across multiple hops, as revealed by a new multi-hop QA dataset.
AI agents that ace isolated coding tasks fall apart when faced with the messy reality of continuous software evolution, dropping from 80% to 38% success rates in a new benchmark.
Current Large Audio Language Models (LALMs) struggle with basic audio understanding tasks like noise localization and cross-lingual speech, with some performing worse than random chance, despite excelling at speech recognition.
AI can generate realistic legal questions, but current models still struggle with diversity and a tendency to agree too much, revealing critical gaps in their ability to simulate adversarial legal reasoning.
Guaranteeing reductions in harm from biased LLM judges is now possible, even when the biases are unknown or adversarially discovered.
Achieve 50% lower latency in Verilog code generation without sacrificing accuracy by adaptively escalating between LLMs based on diagnostic feedback and formal verification.
Forget expert surveys: GPT-4.1-nano can predict the difficulty of data visualization test questions with surprisingly high accuracy, especially when combining visual and textual cues.
Turns out, the best memory design for robotic manipulation depends heavily on the task, with no single architecture dominating across the board.
Ensembling LLMs for educational tasks can backfire, worsening misalignment with actual learning outcomes despite improved benchmark performance.
LLMs struggle to explore multiple valid reasoning paths, often committing to a single route and missing alternative solutions, especially in complex, multi-step logical problems.
LLMs may grasp the broad strokes of causal strategies, but struggle with the devilish details of research design, as revealed by a new benchmark separating causal identification from estimation.
LLM-generated data can provide statistically valid causal effect estimates in social science, but only if you calibrate the simulations with real human data.
Language model capabilities are surprisingly stable over time for most tasks, except for math reasoning, which continues to advance, offering a way to reliably translate compute budgets into performance expectations.
A fine-tuned open-source Mistral-7B model rivals GPT-4 Turbo in extracting clinical history elements from imaging orders, offering a cost-effective and accurate alternative for assessing clinical history completeness.