Search papers, labs, and topics across Lattice.
18 papers from Amazon Science on Eval Frameworks & Benchmarks
Targeted neuro-symbolic integration can reduce content bias in syllogistic reasoning, achieving over 94% accuracy while cutting content effects by 16%.
RAG systems are stuck in a factual echo chamber, ignoring the rich tapestry of opinions that shape real-world understanding.
LLMs can now autonomously translate entire C projects to Rust with near-perfect accuracy, thanks to a novel agentic framework that dynamically navigates dependencies and iteratively verifies translations.
Domain-specific fine-tuning can induce "agentic collapse" in LLMs, but a surprisingly small amount of agentic data from *another* domain can bring those general tool-use skills roaring back.
Forget wrestling with language-specific tooling: ReCodeAgent autonomously translates and validates entire code repositories across diverse languages with a 60% boost in test pass rates.
LLMs aren't culture-aware reasoners, but biased translators: they generate stereotyped metaphors and default to Western perspectives even when prompted with specific cultural identities.
LLMs can automatically generate web vulnerability detection rules with surprisingly high accuracy, but only with careful validation and human oversight to mitigate overconfidence.
LLM-generated survey responses can be statistically accurate yet still miss the option most preferred by humans, highlighting a critical flaw in current evaluation methods.
Forget expensive multilingual annotations: this framework lets you evaluate LLMs in new languages by transferring knowledge from English, with surprisingly strong results.
Save 20% on LLM costs with <2% accuracy drop by strategically cascading a small model with a large one, guided by a confidence-calibrated SLM.
LLMs can ace math problems while reasoning like a drunk toddler, with 82% of correct answers arising from unstable, inconsistent logic.
Safety classifiers for LLMs can catastrophically fail with even minuscule embedding drift, creating dangerous blind spots in deployed safety architectures.
Despite matching or exceeding human expert performance on generating potential diagnoses, current MLLMs struggle to synthesize multimodal clinical evidence for final diagnosis, revealing a critical gap in their clinical reasoning abilities.
Forget Bonferroni: a new sequential testing approach slashes audit times for multi-stream ML systems, especially when anomalies are widespread.
Latent reasoning models often take shortcuts to achieve high accuracy, and stronger supervision, while mitigating this, paradoxically restricts the diversity of their latent representations.
Forget fine-tuning: inject targeted time-series insights into general LLMs and watch their reasoning skills skyrocket by up to 26%.
Object hallucination in MLLMs can be significantly reduced by simply masking salient visual features during contrastive decoding.
LLMs evaluating job candidates exhibit significant bias against hedging language, docking candidates by 25.6% on average, even when the content is equivalent.