Search papers, labs, and topics across Lattice.
13 papers from Amazon Science on Eval Frameworks & Benchmarks
LLM-generated survey responses can be statistically accurate yet still miss the option most preferred by humans, highlighting a critical flaw in current evaluation methods.
Forget expensive multilingual annotations: this framework lets you evaluate LLMs in new languages by transferring knowledge from English, with surprisingly strong results.
Current machine unlearning methods for recommender systems struggle with robustness and sequential deletions, especially in attention-based and recurrent models, highlighting a critical gap ERASE helps to expose.
Save 20% on LLM costs with <2% accuracy drop by strategically cascading a small model with a large one, guided by a confidence-calibrated SLM.
LLMs can ace math problems while reasoning like a drunk toddler, with 82% of correct answers arising from unstable, inconsistent logic.
Safety classifiers for LLMs can catastrophically fail with even minuscule embedding drift, creating dangerous blind spots in deployed safety architectures.
Despite matching or exceeding human expert performance on generating potential diagnoses, current MLLMs struggle to synthesize multimodal clinical evidence for final diagnosis, revealing a critical gap in their clinical reasoning abilities.
Forget Bonferroni: a new sequential testing approach slashes audit times for multi-stream ML systems, especially when anomalies are widespread.
Latent reasoning models often take shortcuts to achieve high accuracy, and stronger supervision, while mitigating this, paradoxically restricts the diversity of their latent representations.
Static benchmarks can be fooled by fluent text and aligned citations, but DREAM leverages agentic evaluation to expose the critical capability mismatch in assessing temporal validity and factual correctness of research agents.
A new benchmark, OPBench, offers researchers a standardized way to evaluate and improve graph learning models tackling the opioid crisis across diverse real-world scenarios.
Object hallucination in MLLMs can be significantly reduced by simply masking salient visual features during contrastive decoding.
LLMs evaluating job candidates exhibit significant bias against hedging language, docking candidates by 25.6% on average, even when the content is equivalent.