Search papers, labs, and topics across Lattice.
6 papers from Allen Institute for AI (AI2) on Eval Frameworks & Benchmarks
LLMs often withhold helpful information due to misinterpreting user intent, but multi-turn conversations can unlock utility—at a cost of new failure modes like "utility lock-in" and "unsafe recovery" that single-turn benchmarks miss.
Stop re-running full benchmarks: Calibrate new LLM datasets against existing suites with just 100 "anchor" questions and still get highly accurate performance predictions.
Synthetic benchmarks can't catch the nuances of personalized deep research, as real users revealed nine critical errors that LLM judges missed entirely.
LLMs still struggle with factual accuracy in specialized medical domains like pancreatic cancer, with hallucination rates varying wildly and web search integration failing to guarantee better responses.
Forget synthetic benchmarks that don't translate: MolmoSpaces offers 230k diverse, simulator-agnostic environments with 130k annotated objects, showing a remarkable 0.96 sim-to-real correlation for robot policies.
RewardBench 2 exposes a stark reality check for reward models: they struggle significantly on new, human-generated prompts, yet this difficulty is surprisingly predictive of their actual usefulness in downstream tasks.