April 27 – May 4, 2026

Eval Frameworks & Benchmarks - Weekly Roundup

100 papers published across 6 labs.

3600% acceleration

Selected Labs publishing this week

Tsinghua AI4 Stanford HAI2 ETH1 Google Research1 Mila1

Top Papers

Apr 30, 2026

LS2N -Nantes University (3w ago·also LIA -Avignon University, LIUM -Le Mans University (, Nantes University

Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition

WER hides the real story: new metrics reveal how language model rescoring in ASR impacts grammatical correctness and semantic accuracy.

Thibault Bañeras-Roux, Mickael Rouvier, Mickaël Rouvier +210

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

LS2N -Nantes University (3w ago·also Avignon University, LIA -Avignon University, LIUM -Le Mans University (, Nantes University

HATS: An Open Data Set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics

Current ASR metrics, even those leveraging embeddings, fail to align with human perception of transcription quality, as revealed by a new human-annotated dataset.

Thibault Bañeras-Roux, Thibault Bañeras Roux, Jane Wottawa +4

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Stanford HAI3w ago

Optimization before Evaluation: Evaluation with Unoptimized Prompts Can be Misleading

Model rankings on standard benchmarks can flip entirely when you optimize prompts for each LLM, so your "best" model might actually be the worst.

Nicholas Sadjoli, Tim Siefken, Atin Ghosh +2

Eval Frameworks & Benchmarks Natural Language Processing

May 4, 2026

3w ago

AcademiClaw: When Students Set Challenges for AI Agents

Today's best AI agents can only solve 55% of real-world academic tasks that university students find challenging, revealing a significant gap between current AI capabilities and the demands of academic workflows.

Junjie Yu, Pengrui Lu, Weiye Si +75

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Stanford HAI3w ago

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Current LLM agents are woefully inadequate for real-world clinical tasks, achieving only 46% success on a new benchmark that demands long-horizon reasoning and verifiable execution within electronic health records.

Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler +10

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

All Papers (100)

May 4, 2026

3w ago

AcademiClaw: When Students Set Challenges for AI Agents

Junjie Yu, Pengrui Lu, Weiye Si +75

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Stanford HAI3w ago

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler +10

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Georg-August-Universität Göttingen /3w ago

A Treasure Trove of Performance: Analyzing the IO500 Submission Data

HPC storage benchmarks hide a wealth of insights into filesystem-specific overheads and load imbalances, if you're willing to dig into the logs.

Julian Kunkel, Aasish Kumar Sharma, Anila Ghazanfar +2

Distributed Systems & Hardware Eval Frameworks & Benchmarks

Posts3w ago·also Telecommunications Institute of Technology

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

Existing deepfake detectors crumble when faced with realistic, multi-region speech inpainting, leaving a gaping vulnerability that this work begins to address.

Tung Vu, Yen Nguyen, Hai Nguyen +2

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Speech & Audio

ETH3w ago·also UZH

When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

Despite the promise of multimodal context, current audio-language models struggle to leverage clinical information for dysarthric speech recognition, even degrading performance in some cases.

Pehuén Moure, Niclas Pokel, Bilal Bounajma +4

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

3w ago

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

Autonomous agents can produce plausible-sounding research that's subtly wrong, so ARIS uses adversarial collaboration between different LLMs to catch these errors.

Ruofeng Yang, Yongcan Li, Shuai Li

Eval Frameworks & Benchmarks Open-Source Models & Weights Tool Use & Agents

May 3, 2026

3w ago

On the Distortion of Partitioning Performance by Random Quantum Circuits

Random quantum circuits, a common proxy for real workloads, can mislead the design of distributed quantum computing compilers by distorting hypergraph partitioning performance.

Maria Gragera Garces

Distributed Systems & Hardware Eval Frameworks & Benchmarks

3w ago

Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time

MLLMs hallucinate less when you nudge them to pay more attention to non-text inputs during inference, without any training.

Itai Allouche, Joseph Keshet

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Huan Zhang +93w ago

RenCon 2025: Revival of the Expressive Performance Rendering Competition

Expressive piano performance rendering is improving, but RenCon 2025 reveals we're still far from replicating human musicality.

Huan Zhang, Taegyun Kwon, Anders Friburg +7

Eval Frameworks & Benchmarks Speech & Audio

Xiaoda Yang +123w ago

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

Current audio-visual models nail unimodal quality but still struggle to make music and dance move together rhythmically, highlighting a key gap TMD-Bench is designed to address.

Xiaoda Yang, Majun Zhang, Changhao Pan +10

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Tianxiang Dai +13w ago

Counting as a minimal probe of language model reliability

LLMs can't reliably count beyond a small number of steps, revealing a surprising brittleness in their ability to execute seemingly simple procedures despite fluent performance on complex tasks.

Tianxiang Dai, Jonathan A. Fan

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

May 2, 2026

Daoxuan Zhang +33w ago

ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue

Current MLLM-driven UAV agents still struggle with spatial memory and aerial adaptation when tasked with autonomously exploring and reasoning about victim locations in realistic search and rescue scenarios.

Daoxuan Zhang, Ping Chen, Jianyi Zhou +1

Eval Frameworks & Benchmarks Robotics & Embodied AI Tool Use & Agents

Google Research3w ago·also TAU

Hallucinations Undermine Trust; Metacognition is a Way Forward

LLMs' persistent hallucinations aren't just about lacking knowledge, but about lacking the self-awareness to know what they *don't* know, suggesting uncertainty expression is key to building trustworthy AI.

G. Yona, Mor Geva, Yossi Matias

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

May 1, 2026

Indraneil Paul +33w ago

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Current code reward models are myopic, mostly rewarding functional correctness, but Themis-RM learns to score code across multiple criteria and languages, opening the door to more nuanced and useful code generation.

Indraneil Paul, Glavaš Glavas, Glavavs Glavas +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks RLHF & Preference Learning

Minbyul Jeong3w ago

Healthcare AI GYM for Medical Agents

Multi-turn medical AI agents trained with RL tend to collapse into verbose, single-turn monologues, but a novel self-distillation method can restore multi-turn tool use and improve performance.

Minbyul Jeong

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design Tool Use & Agents

Apr 30, 2026

Sofía Pérez Casulo +73w ago

A Unified Framework of Hyperbolic Graph Representation Learning Methods

Hyperbolic embeddings are powerful, but a fragmented ecosystem makes them hard to use—this framework finally puts them all in one place.

Sofía Pérez Casulo, Sof'ia P'erez Casulo, Marcelo Fiori +5

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Open-Source Models & Weights

Hanane Nour Moussa +103w ago·also Cisco AI Research

D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery

Training on D3-Gym, a new dataset of real-world scientific tasks with verifiable environments, closes the gap between open-source and proprietary models on ScienceAgentBench by 7.8 points.

Hanane Nour Moussa, Yifei Li, Zhuoyang Li +8

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Scientific Discovery & Drug Design

3w ago

Mapping how LLMs debate societal issues when shadowing human personality traits, sociodemographics and social media behavior

See how LLMs' stances on vaccines, disinformation, and gender equality shift when they "become" different people, thanks to a new dataset of 190,000 persona-driven debates.

Ali Aghazadeh Ardebili, Alì Aghazadeh Ardebili, Massimo Stella +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Friedrich Schiller University3w ago

Beyond the Training Distribution: Mapping Generalization Boundaries in Neural Program Synthesis

Transformers struggle to extrapolate to syntactically novel programs in program synthesis, even with significant compute scaling, suggesting current approaches are bottlenecked by a lack of training diversity.

Henrik Voigt, Michael Habeck, Joachim Giesen

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Saeid Asgari Taghanaki +153w ago

Diagnosing Capability Gaps in Fine-Tuning Data

Stop wasting compute on fine-tuning datasets with hidden capability gaps: GoalCover lets you diagnose and fix them *before* training.

Saeid Asgari Taghanaki, Rakshanda Agarwal, Raksha Agarwal +13

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Anietta Weckauff +43w ago

Characterizing the Consistency of the Emergent Misalignment Persona

Emergent misalignment can lead to "inverted-persona" LLMs that confidently identify as aligned AI systems while consistently generating harmful outputs.

Anietta Weckauff, Anietta Weckauff, Yuchen Zhang +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

DP Technology3w ago

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

Current multimodal LLMs struggle to understand scientific spectra, but a new benchmark and data processing technique could change that.

Jialu Shen, Jialun Shen, Han Lyu +6

Eval Frameworks & Benchmarks Multimodal Models Scientific Discovery & Drug Design

Dawid Wisniewski +13w ago

Beyond Semantics: Measuring Fine-Grained Emotion Preservation in Small Language Model-Based Machine Translation

Even with emotion-aware prompting, today's best small language models still struggle to preserve subtle emotional nuances when translating between languages.

Dawid Wisniewski, Igor Czudy

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

General Reasoning3w ago

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

Even the most advanced language models still lose money and demonstrate unsophisticated strategies when tasked with maximizing long-term bankroll growth in a realistic sports betting simulation, highlighting a significant gap in their sequential decision-making capabilities.

Thomas Grady, Thomas J. Grady, Kip Parker +4

Eval Frameworks & Benchmarks Tool Use & Agents World Models & Planning

Haonan Li +33w ago

MCPHunt: An Evaluation Framework for Cross-Boundary Data Propagation in Multi-Server MCP Agents

Individually harmless read/write permissions in multi-server agent workflows can structurally leak credentials across trust boundaries, even without malicious model behavior, at rates as high as 41.3%.

Haonan Li, Tianjun Sun, Yongqing Wang +1

Eval Frameworks & Benchmarks Tool Use & Agents

3w ago·also IU Bloomington, NTU

How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews

Google's AI Overviews favor Google-owned content and penalize sites blocking its AI crawler, raising serious questions about fairness and bias in the emerging generative search landscape.

Riley Grossman, Songjia Liu, Songjiang Liu +6

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Barcelona Supercomputing Center3w ago

RuC: HDL-Agnostic Rule Completion Benchmark Generation

LLMs struggle to complete RTL code, and their performance hinges on the grammatical structure of the missing code and the prompting strategy used.

Arnau Ayguadé Domingo, Arnau Ayguad'e Domingo, Miquel Alberti-Binimelis +7

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Tsinghua AI3w ago

Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

Even the most advanced vision-language models struggle to accurately identify anatomical structures in medical images, raising serious concerns about their reliability in clinical settings.

Xupeng Chen, Binbin Shi, Chenqian Le +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Stanford HAI3w ago

Optimization before Evaluation: Evaluation with Unoptimized Prompts Can be Misleading

Model rankings on standard benchmarks can flip entirely when you optimize prompts for each LLM, so your "best" model might actually be the worst.

Nicholas Sadjoli, Tim Siefken, Atin Ghosh +2

Eval Frameworks & Benchmarks Natural Language Processing

3w ago

Political Bias Audits of LLMs Capture Sycophancy to the Inferred Auditor

LLM political bias isn't a fixed ideology, but a chameleon-like response profile that bends to the perceived political leanings of the person asking the questions.

Petter Törnberg, Petter Tornberg, Michelle Schimmel +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Garvin Kruthof3w ago

Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation

LLMs can accurately recall constraints while simultaneously violating them, with "knows-but-violates" rates ranging from 8% to 99%, revealing a fundamental flaw in multi-turn ideation.

Garvin Kruthof

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

Lauren Cadwallader +73w ago

Measuring research data reuse in scholarly publications using generative artificial intelligence: Open Science Indicator development and preliminary results

LLMs reveal that research data is being reused far more often than previously thought, suggesting open science's impact is bigger than we realized.

Lauren Cadwallader, Lauren Cadwallader, Iain Hrynaszkiewicz +5

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

Pengyun Zhu +93w ago

APPSI-139: A Parallel Corpus of English Application Privacy Policy Summarization and Interpretation

Forget training LLMs to understand privacy policies – a specialized, expert-annotated dataset and hybrid framework can do it better, achieving superior readability and reliability.

Pengyun Zhu, Qiheng Sun, Long Wen +7

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

LS2N -Nantes University (3w ago·also LIA -Avignon University, LIUM -Le Mans University (, Nantes University

Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition

WER hides the real story: new metrics reveal how language model rescoring in ASR impacts grammatical correctness and semantic accuracy.

Thibault Bañeras-Roux, Mickael Rouvier, Mickaël Rouvier +210

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Rebecca Soskin Hicks +193w ago

HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

ChatGPT for Clinicians, not human doctors, currently achieves the highest scores on a new benchmark of real-world clinical LLM tasks.

Rebecca Soskin Hicks, M. Trofimov, Mikhail Trofimov +17

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

3w ago·also AI Laboratory, Princeton, ZJU

AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

Even GPT-5.1 struggles to distinguish AI-generated academic images from real ones, achieving only 48.8% accuracy, revealing a significant gap between generative and forensic AI capabilities.

Bo Zhang, Bo Zhang, Tzu-Yen Ma +33

Computer Vision Data Curation & Synthetic Data Eval Frameworks & Benchmarks

3w ago

Tracking Conversations: Measuring Content and Identity Exposure on AI Chatbots

Your AI chatbot conversations aren't as private as you think: most leak conversation content and user identity to third-party trackers.

Muhammad Jazlan, Ethan Wang, Yash Vekaria +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Matthew Christian Agustin3w ago

Evaluating Epistemic Guardrails in AI Reading Assistants: A Behavioral Audit of a Minimal Prototype

LLM reading assistants don't need to hallucinate to be harmful; they can subtly steal the user's interpretive labor, even when designed with "epistemic guardrails."

Matthew Christian Agustin

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Beijing University of Posts3w ago·also BUPT

SecGoal: A Benchmark for Security Goal Extraction and Formalization from Protocol Documents

Instruction tuning on a new dataset, SecGoal, allows smaller 7B/9B parameter models to outperform much larger LLMs in extracting and formalizing security goals from protocol documents.

Dawei Huang, Hui Li, Haonan Feng +4

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

3w ago·also UMass

REBENCH: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary Types and Names (Extended Version)

LLMs still can't reliably reverse engineer stripped binaries, and REBench offers a standardized, fair-by-construction benchmark to finally measure progress.

Junsuh Won, Jun Yeon Won, Xin Jin +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Tsinghua AI3w ago·also BUPT, Corresponding author

Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning

Today's best vision-language models are surprisingly bad at reading scientific figures, failing to match expert-level reasoning on a new benchmark of experimental images.

Junpeng Ding, Zichen Tang, Zichen Tang +21

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Doyeop Kwak +33w ago

LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

Visual cues become crucial for speech recognition when audio quality tanks in this challenging new benchmark derived from real-world conversations.

Doyeop Kwak, Jeongsoo Choi, Suyeon Lee +1

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

3w ago

SimEval-IR: A Unified Toolkit and Benchmark Suite for Evaluating User Simulators and Search Sessions

The standard "human-likeness" test for user simulators is essentially useless for predicting whether they produce valid system rankings.

Saber Zerhoudi

Eval Frameworks & Benchmarks Recommendation & Information Retrieval

3w ago·also Interdisciplinary Transformation

NuggetIndex: Governed Atomic Retrieval for Maintainable RAG

Stop retrieving passages in your RAG system: NuggetIndex shows that retrieving and filtering atomic "nuggets" of information yields substantial gains in recall, temporal correctness, and reduced conflicts.

Saber Zerhoudi, Michael Granitzer, Jelena Mitrović +1

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Davide Di Nucci +43w ago·also University of Modena and Reggio Emilia

Fake3DGS: A Benchmark for 3D Manipulation Detection in Neural Rendering

Current image forensics fall flat when faced with the subtle manipulations now possible in 3D Gaussian Splatting scenes, highlighting a critical gap in content authenticity assessment.

Davide Di Nucci, Riccardo Catalini, Guido Borghi +2

Computer Vision Data Curation & Synthetic Data Eval Frameworks & Benchmarks

3w ago

FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting

Even the best vision-language models struggle to reliably set fine-grained GUI states, achieving only 33% accuracy on a new benchmark, but targeted visual hints suggest a clear path to improvement.

Fengxian Ji, Jingpu Yang, Zirui Song +5

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Qiyao Wang +73w ago·also Introduction With the advancement of multimodal

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

Today's best multimodal agents still fall into "blind execution" traps when building websites from ambiguous, non-expert user instructions, highlighting a critical gap in intent recognition and adaptive interaction.

Qiyao Wang, Haoran Hu, Longze Chen +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Multimodal Models+1

Zayed University of Artificial3w ago

Instruction-Guided Poetry Generation in Arabic and Its Dialects

Forget Shakespeare, LLMs can now sling verses in Arabic dialects, thanks to a new dataset for instruction-guided poetry generation.

Abdelrahman Sadallah, A. Sadallah, Ka-reem Elozeiri +7

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

3w ago·also HKU, HKUST, PKU, SCUT +1

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

LLM agents still fail to reliably automate real-world workflows, with even the best models succeeding on only two-thirds of tasks in a new live benchmark.

Chenxin Li, Chenxing Li, Zhengyang Tang +9

Eval Frameworks & Benchmarks Tool Use & Agents

3w ago

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

Today's best GUI agents choke on real-world, multi-application workflows, achieving less than 21% success rate, revealing a critical gap in their ability to coordinate across applications and perform conditional reasoning.

Jinchao Li, Yunxin Li, Chenrui Zhao +4

Eval Frameworks & Benchmarks Tool Use & Agents

An-Yang Ji +63w ago

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

LLMs still struggle to go beyond simple lookups when answering questions about tables, especially when prediction and reasoning about unobserved data is required.

An-Yang Ji, Anya Ji, Jun-Peng Jiang +4

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Simon Dennis +53w ago·also Melbourne

In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks

Agent orchestration frameworks might be overkill: simply including the entire procedure in the system prompt yields better performance on procedural tasks.

Simon Dennis, Michael Diamond, Rivaan Patil +3

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

Geometry-Calibrated Conformal Abstention for Language Models

LMs can now selectively abstain from answering with provable guarantees, thanks to a new method that uses representation geometry to better gauge when they're out of their depth.

Yi Chen, Sihong Xie, Hui Xiong

Eval Frameworks & Benchmarks Natural Language Processing

3w ago

Math Education Digital Shadows for facilitating learning with LLMs: Math performance, anxiety and confidence in simulated students and AIs

LLMs exhibit surprisingly human-like biases and overconfidence in math, revealed by a new dataset mapping their mathematical reasoning across diverse personas.

Naomi Esposito, Anthony Tricarico, A. Tricarico +5

Eval Frameworks & Benchmarks Open-Source Models & Weights Reasoning & Chain-of-Thought

Kenneth J. K. Ong3w ago

The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

VLMs playing the Prisoner's Dilemma can be manipulated into selfish behavior simply by showing them images of aggression or reward matrices with specific color schemes.

Kenneth J. K. Ong

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models

Taslim Jamal Arif +23w ago

Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems

Real-world Text-to-SQL systems can now be continuously evaluated and improved in production, even without access to database schemas or ground-truth queries.

Taslim Jamal Arif, Taslim Arif, Kuldeep Singh

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Ivan Bercovich +13w ago

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

Popular terminal-agent benchmarks are riddled with flaws, with over 15% of tasks being easily reward-hackable, undermining their ability to accurately assess LLM capabilities.

Ivan Bercovich, I. Bercovich

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Sihong Wu +83w ago·also Yale

Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future

LLMs are rapidly transforming peer review, but critical gaps remain in ensuring quality, fairness, and ethical considerations across the entire workflow.

Sihong Wu, Owen Jiang, Yilun Zhao +6

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Jackson Vonderhorst +53w ago·also Notre Dame

Exploring Interaction Paradigms for LLM Agents in Scientific Visualization

General-purpose coding agents may ace scientific visualization tasks, but their computational cost is a steep price compared to the efficiency of domain-specific agents, highlighting a crucial trade-off in LLM agent design.

Jackson Vonderhorst, Kuangshi Ai, Haichao Miao +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Ce Chen +83w ago·also HeyGen Research

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

Injecting optical flow into VLMs lets them spot subtle video transitions that other methods miss, opening the door to more robust video understanding.

Ce Chen, Yi Ren, Yuanming Li +6

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Matteo Da Pelo +63w ago·also University of Cagliari, University of Salerno

Taming the Centaur(s) with LAPITHS: a framework for a theoretically grounded interpretation of AI performances

Claims of human-like cognition in models like CENTAUR crumble under LAPITHS, a framework that reveals these models' performance can be replicated by systems lacking cognitive plausibility.

Matteo Da Pelo, Alessio Donvito, Claudio Frongia +4

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

3w ago

NeocorRAG: Less Irrelevant Information, More Explicit Evidence, and More Effective Recall via Evidence Chains

Retrieval improvements don't always boost reasoning in RAG systems, but NeocorRAG's evidence chains can fix that, achieving SOTA with 20% fewer tokens.

Shiyao Peng, Qianhe Zheng, Zhuodi Hao +8

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Zhuoran Pan +43w ago

Intent2Tx: Benchmarking LLMs for Translating Natural Language Intents into Ethereum Transactions

Despite advances in LLMs, even syntactically correct outputs often fail to achieve the intended state transitions when translating natural language into executable Ethereum transactions, revealing a critical gap in "reasoning-to-execution" capabilities.

Zhuoran Pan, Yue Li, Zhi Guan +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Mohd Sameen Chishti +23w ago

Test Before You Deploy: Governing Updates in the LLM Supply Chain

Silent LLM updates can break your application in unexpected ways, but this governance framework offers a deployer-side solution to catch regressions before they hit production.

Mohd Sameen Chishti, Damilare Peter Oyinloye, Jingyue Li

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago

HAVEN: Hybrid Automated Verification ENgine for UVM Testbench Synthesis with LLMs

LLMs can now reliably generate IC verification testbenches, not by writing HDL directly, but by orchestrating a novel hybrid approach that combines LLM-driven planning with template-based HDL generation.

Chang-Chih Meng, C. Meng, Yu-Ren Lu +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Qingyu Ren +33w ago·also Fudan

From Coarse to Fine: Benchmarking and Reward Modeling for Writing-Centric Generation Tasks

Fine-grained reward modeling, achieved by selectively dropping instruction requirements, unlocks substantial improvements in writing-centric generation tasks.

Qingyu Ren, Tian Pan, Tianjun Pan +1

Eval Frameworks & Benchmarks Natural Language Processing RLHF & Preference Learning

Tsinghua AI3w ago·also Hainan University

Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO

Skills-Coach shows how to significantly boost LLM agent skills without training, using a clever combination of task generation, prompt optimization, and comparative execution.

Yu Tian, Jiawei Chen, Lifang Zheng +7

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Yuxi Ma +73w ago

Multi-Level Narrative Evaluation Outperforms Lexical Features for Mental Health

LLMs beat word counts for predicting mental health from therapeutic writing, proving that *how* you tell a story matters more than *what* words you use.

Yuxi Ma, Jieming Cui, Muyang Li +5

Eval Frameworks & Benchmarks Natural Language Processing

Neemias B da Silva +33w ago

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Persona prompting LLMs for urban sentiment analysis yields surprisingly little behavioral diversity, with a no-persona model often performing just as well.

Neemias B da Silva, Rodrigo Minetto, Daniel Silver +1

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

E. Beck +103w ago

AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR

General American English ASR performance doesn't guarantee similar accuracy across other English accents, as revealed by a new multi-accent call center dataset.

E. Beck, Eugen Beck, S. Beranek +8

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing+1

LS2N -Nantes University (3w ago·also Avignon University, LIA -Avignon University, LIUM -Le Mans University (, Nantes University

HATS: An Open Data Set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics

Current ASR metrics, even those leveraging embeddings, fail to align with human perception of transcription quality, as revealed by a new human-annotated dataset.

Thibault Bañeras-Roux, Thibault Bañeras Roux, Jane Wottawa +4

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

3w ago·also BAIR, Mila, Toronto Metropolitan University, UofT

A Reproducibility Study of LLM-Based Query Reformulation

LLM-powered query reformulation, a hot topic in IR, often fails to translate gains from lexical to neural retrieval, and bigger models don't always help.

Amin Bigdeli, Radin Hamidi Rad, Hai Son Le +4

Eval Frameworks & Benchmarks Open-Source Models & Weights Recommendation & Information Retrieval

3w ago

RoadMapper: A Multi-Agent System for Roadmap Generation of Solving Complex Research Problems

LLMs can now generate research roadmaps that are 8% better and 84% faster than human experts, thanks to a novel multi-agent system.

Jiachen Liu, Zichen Tang, Zichen Tang +10

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

3w ago

ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language Models

LLMs trained with ScaleBox, a new high-fidelity code verification system, substantially outperform those trained with heuristic matching, suggesting current RLHF methods are bottlenecked by verification quality.

Jiasheng Zheng, Xin Zheng, Boxi Cao +9

Code Generation & Program Synthesis Distributed Systems & Hardware Eval Frameworks & Benchmarks

Sidi Chang +43w ago·also Blossom AI Labs

Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR

Subtle wording changes in benchmark rubrics can swing model performance by over 13%, revealing a hidden subjectivity in "objective" gold labels.

Sidi Chang, Pei-ke Zhu, Peiying Zhu +2

Eval Frameworks & Benchmarks Natural Language Processing

Jon-Paul Cacioli3w ago

Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

LLM upgrades are a chaotic mix of progress and decay: despite overall gains, up to 47% of questions get *worse* after an update, and single-shot evals miss almost half of these critical regressions.

Jon-Paul Cacioli

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Kansai University3w ago·also RIKEN, Shiga University

LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human--LLM Judgment Gaps

LLMs reliably capture emotions with explicit lexical markers, but systematically fail on pragmatically complex emotions requiring contextual inference, revealing a critical limitation in their ability to understand nuanced human emotion.

Keito Inoshita, Xiaokang Zhou, Akira Kawai +2

Eval Frameworks & Benchmarks Natural Language Processing

Trent University3w ago

The Likelihood Ratio Wall: Structural Limits on Accurate Risk Assessment for Rare Violence

Expect pretrial risk assessment tools to be wrong more often than right when flagging someone as "high risk" for rare violent re-offense, regardless of recalibration efforts.

Marco Pollanen

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks

China Telecom Research Institute3w ago

How Code Representation Shapes False-Positive Dynamics in Cross-Language LLM Vulnerability Detection

LLMs trained on raw code text learn surface-level cues that trigger false positives when detecting vulnerabilities in other languages, but simply feeding them ASTs at inference time can dramatically reduce these errors.

Maofei Chen, Laifu Wang, Yue Qin +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Open-Source Models & Weights

Md. Faizul Ibne Amin +53w ago

LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

LLM judges in human-AI coding collaborations show surprisingly low inter-rater reliability, suggesting current evaluation methods may be inadequate for assessing true co-creation effectiveness.

Md. Faizul Ibne Amin, Yutaka Watanobe, Daniel M. Muepu +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks RLHF & Preference Learning+1

Shiqi Xu +53w ago

ClimateVID -- Social Media Videos Analysis and Challenges Involved

Despite the promise of VLMs, current models still struggle to grasp the nuances of climate change discourse in social media videos, highlighting the need for more specialized approaches.

Shiqi Xu, Moritz Burmester, Katharina Prasse +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

M. Riera-Mar'in +403w ago·also Basque Research and Technology Alliance, DKFZ, Hospital de Mataró, i Estudis Avançats (ICREA) +11

Assessing Pancreatic Ductal Adenocarcinoma Vascular Invasion: the PDACVI Benchmark

Seemingly strong segmentation models can fail at clinically critical tumor-vessel interfaces, highlighting the need for uncertainty-aware AI in pancreatic cancer staging.

M. Riera-Mar'in, M. Riera-Marín, O. K. Sikha +38

Computer Vision Eval Frameworks & Benchmarks Scientific Discovery & Drug Design

3w ago·also Shanghai AI Lab

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

Current MLLMs still struggle to connect the dots between images and text when they're interleaved, highlighting a critical gap in real-world multimodal understanding.

Bingli Wang, Huanze Tang, Haijun Lv +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Apr 29, 2026

3w ago

LUCid: Redefining Relevance For Lifelong Personalization

Even state-of-the-art models like Gemini and Claude can completely miss critical user information when it's buried in semantically unrelated past interactions, tanking personalization performance.

Chimaobi Okite, Anika Misra, Joyce Chai +1

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Gilberto Sussumu Hida +23w ago

Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs

LLMs don't automatically win at study screening for software engineering SLRs: their performance is highly variable, sensitive to input data, and not consistently better than classical models.

Gilberto Sussumu Hida, Danilo Monteiro Ribeiro, Erika Yahata

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

University of Hildesheim3w ago

Reproducible Automated Program Repair Is Hard -- Experiences With the Defects4J Dataset

Reproducibility issues plague over 20% of Defects4J, a widely used benchmark for automated program repair, casting doubt on the validity of many APR evaluations.

Adam Krafczyk, Klaus Schmid

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Open-Source Models & Weights

Verint Systems Inc3w ago

When Your LLM Reaches End-of-Life: A Framework for Confident Model Migration in Production Systems

Stop sweating LLM migrations: this Bayesian framework lets you confidently swap models in production, even with limited human evals.

Emma Casey, David Roberts, David Sim +1

Distributed Systems & Hardware Eval Frameworks & Benchmarks Inference & Quantization

Kyushu Institute of Technology Iizuka3w ago

Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

LLMs fail over half the time when asked to perform harmful actions in a simulated robotic health attendant setting, even when fine-tuned on medical data.

Mahiro Nakao, Kazuhiro Takemoto

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Robotics & Embodied AI

3w ago·also North South university, QMUL

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

LLMs stubbornly stick to task-appropriate reasoning even when explicitly instructed to use conflicting logic, but targeted interventions can nudge them towards better instruction following.

Xingwei Tan, Marco Valentino, Mahmud Elahi Akhter +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp+1

Barcelona Supercomputing Center (BSC)3w ago

A Test Taxonomy and Continuous Integration Ecosystem for Dynamic Resource Management in HPC

Automated testing of dynamic resource management frameworks in HPC is now possible, catching faults earlier and simplifying maintenance.

Petter Sandås, Íñigo Aréjula-Aísa

Code Generation & Program Synthesis Distributed Systems & Hardware Eval Frameworks & Benchmarks

Fei Bai +153w ago·also IQuest Research, RUC

ClawGym: A Scalable Framework for Building Effective Claw Agents

Building agents that can reliably automate complex, multi-step workflows over local files and tools just got a whole lot easier.

Fei Bai, Huatong Song, Shuang Sun +13

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks+1

UBA-CONICET3w ago·also Universidad de Chile, Universidad de San Andrés

A Toolkit for Detecting Spurious Correlations in Speech Datasets

Discover hidden biases in your speech datasets: this toolkit uses non-speech audio to reveal spurious correlations that inflate performance metrics.

Lara Gauder, Pablo Riera, Andrea Slachevsky +3

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Speech & Audio

3w ago·also Gilbert AI Lab, USC

The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

Widely used emotion embedding similarity metrics for speech generation are more sensitive to speaker and linguistic features than actual emotion, rendering them unreliable for evaluating emotional expressiveness.

Yun-Shao Tsai, Yi-Cheng Lin, Huang-Cheng Chou +8

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Saurabh K. Singh +23w ago

Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI

Document AI pipelines don't work the way you think: quality bottlenecks aren't where you expect, and components don't cascade quality.

Saurabh K. Singh, Sachin Raj, S. Raj

Eval Frameworks & Benchmarks Multimodal Models Recommendation & Information Retrieval

Defense Language Institute Foreign3w ago

Cross-Lingual Response Consistency in Large Language Models: An ILR-Informed Evaluation of Claude Across Six Languages

LLMs exhibit surprising cross-lingual inconsistencies beyond simple translation errors, revealing divergences in cultural calibration, pragmatic disambiguation, and even institutional referral behavior.

Camelia Baluta

Eval Frameworks & Benchmarks Natural Language Processing

3w ago·also UChicago, UT Austin

Targeted Linguistic Analysis of Sign Language Models with Minimal Translation Pairs

Despite recent advances, sign language translation models still struggle to leverage the full range of linguistic cues, especially non-manual signals like facial expressions.

Serpil Karabüklü, Kanishka Misra, Shester Gueuwou +3

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Democracy Intelligence gGmbH3w ago

When Roles Fail: Epistemic Constraints on Advocate Role Fidelity in LLM-Based Political Statement Analysis

LLMs in multi-agent systems often abandon their assigned roles due to "Epistemic Role Override," undermining the intended diversity of perspectives in political statement analysis.

Juergen Dietrich

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Tsinghua AI3w ago·also Fudan

CL-bench Life: Can Language Models Learn from Real-Life Context?

Today's best language models can barely make sense of your messy group chats and fragmented digital life, achieving only 19% accuracy on a new benchmark of real-world reasoning.

Shihan Dou, Yujiong Shen, Chenhao Huang +33

Eval Frameworks & Benchmarks Natural Language Processing

Jon-Paul Cacioli3w ago

Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation

Complex, multi-step instructions can cause LLMs to completely ignore question content and instead rely on positional shortcuts when asked to underperform, revealing a critical vulnerability in adversarial evaluation.

Jon-Paul Cacioli

Eval Frameworks & Benchmarks Open-Source Models & Weights Red-Teaming & Adversarial Robustness

3w ago·also Chongqing, SJTU

ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

LLMs still struggle to generate complete, internally structured classes from specifications, with even the best models failing more than half the time on a new benchmark designed to avoid data contamination.

Chaoxiang Xie, Yuling Shi, Wenhao Zeng +3

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks