Search papers, labs, and topics across Lattice.
10 papers from Allen Institute for AI (AI2) on Eval Frameworks & Benchmarks
Synthetic benchmarks can't catch the nuances of personalized deep research, as real users revealed nine critical errors that LLM judges missed entirely.
Training on SciMDR, a new 300K QA dataset synthesized from scientific papers, substantially boosts model performance on complex, document-level scientific reasoning tasks.
Reasoning unlocks factual knowledge in LLMs, but beware: hallucinated reasoning steps can poison the well.
Stop generating superficial reviews: RbtAct leverages rebuttals to train LLMs to provide actionable feedback, leading to concrete revisions and improved author uptake.
LLMs still struggle with factual accuracy in specialized medical domains like pancreatic cancer, with hallucination rates varying wildly and web search integration failing to guarantee better responses.
VLMs that ace math problems still flunk at understanding *how* students go wrong, highlighting a critical gap for AI in education.
Even the most advanced LLMs fall short in simulating scientific progress, producing synthetic research corpora that lack the diversity and novelty of human-authored work.
Agents that ace long-context recall can still bomb when they need to use that memory to actually *do* something, revealing a critical flaw in how we currently evaluate memory in AI.
Forget synthetic benchmarks that don't translate: MolmoSpaces offers 230k diverse, simulator-agnostic environments with 130k annotated objects, showing a remarkable 0.96 sim-to-real correlation for robot policies.
RewardBench 2 exposes a stark reality check for reward models: they struggle significantly on new, human-generated prompts, yet this difficulty is surprisingly predictive of their actual usefulness in downstream tasks.