Amazon Science

×Eval Frameworks & Benchmarks

18 papers from Amazon Science on Eval Frameworks & Benchmarks

Apr 20, 2026

Amazon ScienceApr 20, 2026

FregeLogic at SemEval 2026 Task 11: A Hybrid Neuro-Symbolic Architecture for Content-Robust Syllogistic Validity Prediction

Targeted neuro-symbolic integration can reduce content bias in syllogistic reasoning, achieving over 94% accuracy while cutting content effects by 16%.

Adewale Akinfaderin, Nafi Diallo

Eval Frameworks & Benchmarks Open-Source Models & Weights Reasoning & Chain-of-Thought

Apr 13, 2026

Amazon ScienceApr 13, 2026

Beyond Factual Grounding: The Case for Opinion-Aware Retrieval-Augmented Generation

RAG systems are stuck in a factual echo chamber, ignoring the rich tapestry of opinions that shape real-world understanding.

Aditya Agrawal, Alwarappan Nakkiran, Darshan Fofadiya +2

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Apr 13, 2026·also Amazon Science, UC Davis

ORBIT: Guided Agentic Orchestration for Autonomous C-to-Rust Transpilation

LLMs can now autonomously translate entire C projects to Rust with near-perfect accuracy, thanks to a novel agentic framework that dynamically navigates dependencies and iteratively verifies translations.

Muhammad Farrukh, Baris Coskun, Tapti Palit +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Apr 9, 2026

Apr 9, 2026·also Amazon Science

Awakening the Sleeping Agent: Lean-Specific Agentic Data Reactivates General Tool Use in Goedel Prover

Domain-specific fine-tuning can induce "agentic collapse" in LLMs, but a surprisingly small amount of agentic data from *another* domain can bring those general tool-use skills roaring back.

Jui-Hui Chung, J.H. Chung, Hongzhou Lin +6

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Apr 8, 2026

Apr 8, 2026·also Amazon Science

ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories

Forget wrestling with language-specific tooling: ReCodeAgent autonomously translates and validates entire code repositories across diverse languages with a 60% boost in test pass rates.

Ali Reza Ibrahimzada, Brandon Paulsen, Daniel Kroening +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Apr 6, 2026

Amazon ScienceApr 6, 2026

Metaphors We Compute By: A Computational Audit of Cultural Translation vs. Thinking in LLMs

LLMs aren't culture-aware reasoners, but biased translators: they generate stereotyped metaphors and default to Western perspectives even when prompted with specific cultural identities.

Yuan Chang, Jiaming Qu, Zhu Li

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Apr 2, 2026

Amazon ScienceApr 2, 2026·also JHU, University of Waikato

RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale

LLMs can automatically generate web vulnerability detection rules with surprisingly high accuracy, but only with careful validation and human oversight to mitigate overconfidence.

Ayush Garg, Sophia Hager, Jacob Montiel +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Mar 19, 2026

Amazon ScienceMar 19, 2026

RADIUS: Ranking, Distribution, and Significance - A Comprehensive Alignment Suite for Survey Simulation

LLM-generated survey responses can be statistically accurate yet still miss the option most preferred by humans, highlighting a critical flaw in current evaluation methods.

Weronika Łajewska, Weronika Lajewska, Paul Missault +2

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Amazon ScienceMar 19, 2026

Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition

Forget expensive multilingual annotations: this framework lets you evaluate LLMs in new languages by transferring knowledge from English, with surprisingly strong results.

Ivaxi Sheth, Ivaxi Sheth, Zeno Jonke +5

Eval Frameworks & Benchmarks Natural Language Processing

Mar 4, 2026

Amazon ScienceMar 4, 2026

Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning

Save 20% on LLM costs with <2% accuracy drop by strategically cascading a small model with a large one, guided by a confidence-calibrated SLM.

Chuang Zhang, Zizhen Zhu, Yihao Wei +4

Eval Frameworks & Benchmarks Inference & Quantization Reasoning & Chain-of-Thought

Mar 3, 2026

Amazon ScienceMar 3, 2026·also Meta AI, Stanford HAI

When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning

LLMs can ace math problems while reasoning like a drunk toddler, with 82% of correct answers arising from unstable, inconsistent logic.

Aman Chadha, Vinija Jain

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Mar 1, 2026

Amazon ScienceMar 1, 2026·also Meta AI, Stanford HAI

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Safety classifiers for LLMs can catastrophically fail with even minuscule embedding drift, creating dangerous blind spots in deployed safety architectures.

Vinija Jain, Aman Chadha

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Feb 25, 2026

Amazon ScienceFeb 25, 2026·also MBZUAI, Oxford, PKU, University of Bern

MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models

Despite matching or exceeding human expert performance on generating potential diagnoses, current MLLMs struggle to synthesize multimodal clinical evidence for final diagnosis, revealing a critical gap in their clinical reasoning abilities.

Xudong Liu, Xudong Liu, Jiachuan Peng +9

Eval Frameworks & Benchmarks Multimodal Models Scientific Discovery & Drug Design

Feb 25, 2026·also Amazon Science

Global Sequential Testing for Multi-Stream Auditing

Forget Bonferroni: a new sequential testing approach slashes audit times for multi-stream ML systems, especially when anomalies are widespread.

Beepul Bharti, Ambar Pal, Jeremias Sulam

Eval Frameworks & Benchmarks

Amazon ScienceFeb 25, 2026·also Michigan State

How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

Latent reasoning models often take shortcuts to achieve high accuracy, and stronger supervision, while mitigating this, paradoxically restricts the diversity of their latent representations.

Yingqian Cui, Zhenwei Dai, Bing He +5

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Feb 23, 2026

Amazon ScienceFeb 23, 2026·also Anthropic

SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning

Forget fine-tuning: inject targeted time-series insights into general LLMs and watch their reasoning skills skyrocket by up to 26%.

Zelin He, Zelin He, Boran Han +17

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Feb 12, 2026

Amazon ScienceFeb 12, 2026

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

Object hallucination in MLLMs can be significantly reduced by simply masking salient visual features during contrastive decoding.

Xudong Liu

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Aug 6, 2025

UWAug 6, 2025·also Amazon Science, BAIR, Stanford HAI

I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations

LLMs evaluating job candidates exhibit significant bias against hedging language, docking candidates by 25.6% on average, even when the content is equivalent.

Julia Kharchenko, Tanya Roosta, Aman Chadha +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Search

Amazon Science