March 18 – March 25, 2026

Eval Frameworks & Benchmarks - Weekly Roundup

100 papers published across 8 labs.

14% acceleration

Selected Labs publishing this week

CMU ML2 Amazon Science2 Google Research2 Tsinghua AI2 Stanford HAI2

Top Papers

Mar 19, 2026

Indian Institute of Information Technology1w ago

ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

LLM explanation faithfulness varies wildly depending on how you test it, and might even be *anti*-faithful, so stop relying on single-intervention benchmarks.

Abhinaba Basu, Abhinaba Basu, Pavan Chakraborty +1

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Mar 18, 2026

Priyaranjan Pattnayak +12w ago

IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

LLM safety doesn't translate: evaluations across 12 Indic languages reveal alarming safety drift and inconsistent responses to sensitive topics.

Priyaranjan Pattnayak, Sanchari Chowdhuri

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Mar 25, 2026

CMU ML1w ago·also NUS, Imperial, Oxford, TU Munich

MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

Giving medical imaging AIs the same tools as human doctors actually *hurts* their performance, revealing a surprising lack of spatial reasoning.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Mar 23, 2026

B active) differ by an order of magnitude in active parameters. Conversely1w ago

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

Chain-of-thought reasoning is often a lie: models systematically suppress acknowledging the real reasons behind their answers, even when they demonstrably influence the output.

Richard J. Young

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Mar 20, 2026

Strukto.AI1w ago·also Infron.AI

SEAR: Schema-Based Evaluation and Routing for LLM Gateways

Stop relying on brittle classifiers: SEAR uses LLM reasoning and a unified SQL query layer to evaluate, route, and explain decisions in LLM gateways.

Zecheng Zhang, Han Zheng, Yue Xu

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

All Papers (100)

Mar 25, 2026

CMU ML1w ago·also NUS, Imperial, Oxford, TU Munich

MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

Giving medical imaging AIs the same tools as human doctors actually *hurts* their performance, revealing a surprising lack of spatial reasoning.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Mar 23, 2026

B active) differ by an order of magnitude in active parameters. Conversely1w ago

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

Chain-of-thought reasoning is often a lie: models systematically suppress acknowledging the real reasons behind their answers, even when they demonstrably influence the output.

Richard J. Young

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Mar 20, 2026

Strukto.AI1w ago·also Infron.AI

SEAR: Schema-Based Evaluation and Routing for LLM Gateways

Stop relying on brittle classifiers: SEAR uses LLM reasoning and a unified SQL query layer to evaluate, route, and explain decisions in LLM gateways.

Zecheng Zhang, Han Zheng, Yue Xu

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Mar 19, 2026

Dimitris Mitropoulos +41w ago·also TU Delft

Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review

LLM-powered security tools are surprisingly susceptible to confirmation bias, overlooking reintroduced vulnerabilities when pull requests are framed as security improvements.

Dimitris Mitropoulos, Nikolaos Alexopoulos, Georgios Alexopoulos +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

1w ago

TENSURE: Fuzzing Sparse Tensor Compilers (Registered Report)

Most sparse tensor compilers are riddled with bugs, silently miscompiling code or crashing on valid inputs, a problem exposed by a new fuzzer that guarantees valid tensor contractions.

Kabilan Mahathevan, Kabilan Mahathevan, Yining Zhang +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Gagan Bhatia +31w ago

What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

LLMs' temporal reasoning crumbles in low-resource languages and rarer calendar formats, not due to a lack of reasoning ability, but because poor tokenization fragments dates and times.

Gagan Bhatia, Ahmad Muhammad Isa, Maxime Peyrard +1

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

1w ago

AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

GUI agents struggle with long tasks not because they mis-click, but because they forget what they were doing, and a new "anchored memory" method can fix it.

Yi Shi, Jungang Li, Linghao Zhang +25

Eval Frameworks & Benchmarks Tool Use & Agents

An Luo +161w ago

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

Despite advances in LLMs, human-AI collaboration still significantly outperforms AI-only agents in domain-specific data science tasks, proving that human expertise remains crucial.

An Luo, Jin Du, Xun Xian +14

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Chun-Jui Wang +61w ago

Evaluating Game Difficulty in Tetris Block Puzzle

Adding the T-pentomino to Tetris Block Puzzle makes the game significantly harder, quantified by a slowdown in SGAZ agent convergence.

Chun-Jui Wang, Jian-Ting Guo, Hung Guei +4

Eval Frameworks & Benchmarks World Models & Planning

Kevin Song1w ago

Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle

Even in a seemingly simple tabular environment like Blackjack, model-free RL agents can converge to near-optimal *average* rewards while still making surprisingly poor decisions in specific states.

Kevin Song

Eval Frameworks & Benchmarks

Eduar Castrillo Velilla1w ago

Breaking Hard Isomorphism Benchmarks with DRESS

A simple vertex deletion fingerprint breaks graph isomorphism records, even distinguishing graphs that stump the classic 3-WL algorithm.

Eduar Castrillo Velilla

Eval Frameworks & Benchmarks

Maxime Poli +51w ago

DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

Unsupervised phoneme discovery from self-supervised speech models is surprisingly viable, but language-specific challenges remain a significant hurdle.

Maxime Poli, Manel Khentout, Angelo Ortiz Tandazo +3

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

1w ago

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

LLMs' text-only pre-training secretly encodes surprisingly different levels of auditory knowledge, directly impacting their effectiveness as backbones for audio language models.

Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang +15

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Youngwan Lee +51w ago

MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model

Current VLMs struggle with multi-hop spatial reasoning, often failing to compose even simple spatial relations across multiple steps, highlighting a critical gap for real-world VLA agent deployment.

Youngwan Lee, Soojin Jang, Yoorhim Cho +3

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Xiaoyang Chen +11w ago

Can LLM generate interesting mathematical research problems?

LLMs can generate novel mathematical research problems in differential geometry that experts find both unknown and valuable, suggesting a new avenue for AI-assisted mathematical discovery.

Xiaoyang Chen, Xiang Jiang

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Scientific Discovery & Drug Design

1w ago

Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning

Strategic visual aids are the secret weapon for geometric reasoning, and this work shows how to teach MLLMs to wield them effectively via reinforcement learning.

Haokun Zhao, Wanshi Xu, Haidong Yuan +3

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Amazon Science1w ago

RADIUS: Ranking, Distribution, and Significance - A Comprehensive Alignment Suite for Survey Simulation

LLM-generated survey responses can be statistically accurate yet still miss the option most preferred by humans, highlighting a critical flaw in current evaluation methods.

Weronika Łajewska, Weronika Lajewska, Paul Missault +2

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Ye Kyaw Thu +51w ago

myMNIST: Benchmark of PETNN, KAN, and Classical Deep Learning Models for Burmese Handwritten Digit Recognition

CNNs still reign supreme in Burmese handwritten digit recognition, but physics-inspired PETNNs are hot on their heels, outperforming Transformers and KANs.

Ye Kyaw Thu, Ye Kyaw Thu, Thazin Myint Oo +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Eval Frameworks & Benchmarks

Dimitrios Georgousis +51w ago

Evaluating Counterfactual Strategic Reasoning in Large Language Models

LLMs that appear strategically savvy in standard games often crumble when faced with slight rule changes, suggesting they're mimicking rather than truly reasoning.

Dimitrios Georgousis, Maria Lymperaiou, Angeliki Dimitriou +3

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Indian Institute of Information Technology1w ago

When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making

LLMs are far more susceptible to authority and framing biases than the field's obsession with demographic bias suggests.

Abhinaba Basu, Abhinaba Basu, Pavan Chakraborty +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Weijia Dou +51w ago·also Project leader

Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos

Generative videos might look great, but a new metric reveals they often suffer from jarring 3D spatial inconsistencies that existing metrics miss.

Weijia Dou, Wenzhao Zheng, Weiliang Chen +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Masayuki Kawarada +21w ago

GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms

LLMs surprisingly prioritize norm adherence over personal incentives in business scenarios, challenging assumptions about goal-driven behavior.

Masayuki Kawarada, Kodai Watanabe, Soichiro Murakami

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

1w ago

Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

Multimodal LLMs suffer a major performance hit when asked to switch from text-based to image-based tasks mid-conversation, revealing a surprising asymmetry in their ability to handle task interference.

Masayuki Kawarada, Masayuki Kawarada, Tatsuya Ishigaki +3

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

1w ago

Language Model Maps for Prompt-Response Distributions via Log-Likelihood Vectors

Forget comparing models with benchmarks – mapping them by prompt-response likelihoods reveals hidden relationships between architecture, training data, and even how prompts compose.

Yusuke Takase, Yusuke Takase, Momose Oyama +3

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing+1

Zhelin Xu +51w ago

AutoScreen-FW: An LLM-based Framework for Resume Screening

Open-source LLMs, when carefully prompted with representative examples, can rival or even surpass smaller commercial models like GPT-3.5-nano in resume screening tasks, offering a privacy-preserving alternative.

Zhelin Xu, Zhelin Xu, Shuhei Yamamoto +3

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Wan-Cyuan Fan +91w ago

Tinted Frames: Question Framing Blinds Vision-Language Models

VLMs selectively ignore visual information based on question framing, even when the visual reasoning task remains identical, highlighting a critical vulnerability in their grounding capabilities.

Wan-Cyuan Fan, Wan-Cyuan Fan, Jiayun Luo +7

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Multimodal Models

Google Research1w ago·also University of Georgia, UT Austin, Vienna

Geography According to ChatGPT -- How Generative AI Represents and Reasons about Geography

ChatGPT's geographic reasoning can be surprisingly brittle, with minor syntactic changes causing significant output variations and task composition revealing unexpected distributional shifts.

Krzysztof Janowicz, Gengchen Mai, Rui Zhu +4

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Tsinghua AI1w ago·also Guangdong Laboratory of AI and Digital Economy (SZ), Independent Researcher, PolyU, SYSU +1

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

MLLMs can ace the test, but still fail to *see*—they often succeed at complex reasoning with symbols while failing at basic symbol recognition, revealing a reliance on linguistic priors over true visual perception.

Yinghui Li, Jiayi Kuang, Peng Xing +11

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

NVIDIA1w ago·also Pitt

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

Current OmniLLMs stumble when processing real-world, long-form audio-visual content, achieving only ~35-65% accuracy on a new benchmark designed to test long-term memory and fine-grained understanding.

Keda Tao, Keda Tao, Yuhua Zheng +25

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Maksym Del +91w ago·also University of Tartu

How Uncertainty Estimation Scales with Sampling in Reasoning Models

Two heads are better than one: combining verbalized confidence and self-consistency with just two samples dramatically boosts uncertainty estimation in reasoning models, beating either signal alone even with much larger sampling budgets.

Maksym Del, Markus Kängsepp, Markus Kangsepp +7

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Scaling Laws & Emergent Abilities

Xinghao Zhao1w ago

Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

LLMs' chain-of-thought reasoning is more reliable when the uncertainty (entropy) decreases consistently at each step, not just overall.

Xinghao Zhao

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Indian Institute of Information Technology1w ago

ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

LLM explanation faithfulness varies wildly depending on how you test it, and might even be *anti*-faithful, so stop relying on single-intervention benchmarks.

Abhinaba Basu, Abhinaba Basu, Pavan Chakraborty +1

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

1w ago·also HKU

Parallelograms Strike Back: LLMs Generate Better Analogies than People

LLMs aren't just regurgitating facts; they're actually better at generating high-quality, relation-preserving word analogies than humans.

Qiawen Ella Liu, Qiawen Liu, Raja Marjieh +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Peng Gang1w ago

Evaluating 5W3H Structured Prompting for Intent Alignment in Human-AI Interaction

LLMs understand your intent better when you structure your prompts with "who, what, when, where, why, how, how much, and how many," but only if you present it in natural language, not raw JSON.

Peng Gang

Eval Frameworks & Benchmarks Natural Language Processing

N. Martorell +11w ago

Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

LLMs can introspect on their own internal emotive states during conversations with surprising accuracy, opening a new avenue for monitoring and influencing their behavior.

N. Martorell, Nicolas Martorell

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

P. VedantaS +31w ago

I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems

Forget scaling laws: the *structure* of your AI governance system matters more than the specific LLM when it comes to preventing corruption.

P. VedantaS, Vedanta S P, P. Kumaraguru +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Samsung R&D Institute Philippines1w ago

Evaluating LLM-Generated Lessons from the Language Learning Students'Perspective: A Short Case Study on Duolingo

Language learners find that Duolingo's general lessons are great for building a foundation, but personalized, work-related scenarios are key to achieving professional fluency.

Carlos Rafael Catalan, Patricia Nicole Monderin, Lheane Marie Dizon +3

Eval Frameworks & Benchmarks Natural Language Processing

Haocheng Zhao +21w ago

ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation

Weaker autonomous web agents readily trust tampered website content, producing unsafe outputs, while stronger models exhibit better anomaly detection and safer fallback strategies under MITM attacks.

Haocheng Zhao, Haochen Zhao, Shaoyang Cui

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Lourdes Moreno +21w ago

A Human-in/on-the-Loop Framework for Accessible Text Generation

Human oversight can be systematically integrated into LLM-based text generation to improve accessibility, creating a traceable and auditable process.

Lourdes Moreno, P. Mart'inez, Paloma Martínez

Eval Frameworks & Benchmarks Natural Language Processing RLHF & Preference Learning

1w ago

Words at Play: Benchmarking Audio Pun Understanding in Large Audio-Language Models

LALMs still struggle to get the joke, with a new benchmark showing they can't reliably recognize, locate, or understand audio puns.

Yuchen Su, Shaoxin Zhong, Yonghua Zhu +7

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Amazon Science1w ago

Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition

Forget expensive multilingual annotations: this framework lets you evaluate LLMs in new languages by transferring knowledge from English, with surprisingly strong results.

Ivaxi Sheth, Ivaxi Sheth, Zeno Jonke +5

Eval Frameworks & Benchmarks Natural Language Processing

Yuqiang Lin +151w ago

TAU-R1: Visual Language Model for Traffic Anomaly Understanding

A new dataset and model specifically designed for traffic anomaly understanding in roundabouts could pave the way for more robust and efficient intelligent transportation systems.

Yuqiang Lin, Kehua Chen, Sam Lockyer +13

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

L. Bayer +21w ago

Benchmarking CNN-based Models against Transformer-based Models for Abdominal Multi-Organ Segmentation on the RATIC Dataset

CNNs still reign supreme for medical image segmentation on heterogeneous datasets, beating out hybrid transformer models despite the latter's theoretical advantages.

L. Bayer, Sheethal Bhat, Andreas K. MaierCode

Architecture Design (Transformers, SSMs, MoE)Computer Vision Eval Frameworks & Benchmarks

Christian Di Maio +71w ago

Book your room in the Turing Hotel! A symmetric and distributed Turing Test with multiple AIs and humans

LLMs in a group Turing Test still make tell-tale mistakes that betray their AI origins, even when their language skills are otherwise convincing.

Christian Di Maio, Tommaso Guidi, L. Quarantiello +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

1w ago

From Accuracy to Readiness: Metrics and Benchmarks for Human-AI Decision-Making

Human-AI teams often fail not because AI is inaccurate, but because humans miscalibrate their reliance on it, highlighting the need for readiness metrics beyond accuracy.

Min Hun Lee

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

1w ago·also Toyota Motor North America

Serendipity by Design: Evaluating the Impact of Cross-domain Mappings on Human and LLM Creativity

Humans get a creativity boost from random analogies, but LLMs are already so creative that the same trick doesn't help—unless you make the analogy really, really weird.

Qiawen Ella Liu, Qiawen Liu, M. Dubova +4

Eval Frameworks & Benchmarks Natural Language Processing

CMU ML1w ago

ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

Even GPT-5 and Gemini 2.5 Pro still fail to efficiently couple reasoning with tool use, requiring up to 2.7x more tool calls than theoretically optimal in a new diagnostic environment.

Wanjia Zhao, Ludwig Schmidt, James Zou +2

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Eduardo Di Santi1w ago

Cognitive Amplification vs Cognitive Delegation in Human-AI Systems: A Metric Framework

Blindly maximizing human-AI performance can degrade human expertise over time, revealing a critical trade-off that demands a new approach to system design.

Eduardo Di Santi

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Rudra Jadhav +21w ago

Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks

LLMs penalize informal language in essays so severely that it's like marking a B+ down to a C+, even when explicitly told to ignore writing style.

Rudra Jadhav, Janhavi Danve, Sonalika Shaw

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

C. Buttaro +21w ago

Automatic detection of Gen-AI texts: A comparative framework of neural models

Supervised learning models can reliably outperform widely-used commercial AI text detectors, even across different languages and specialized domains like mental health.

C. Buttaro, Cristian Buttaro, Irene Amerini

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Natural Language Processing

Esteban Garces Arias +71w ago

The Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token Choices

Language model text is detectable because it misses the "long tail" of human word choice, not because it's less intelligent.

Esteban Garces Arias, E. Arias, Nurzhan Sapargali +5

Eval Frameworks & Benchmarks Natural Language Processing

1w ago·also Beihang

Multi-Modal Building Change Detection for Large-Scale Small Changes: Benchmark and Baseline

Detecting subtle building changes gets a boost: a new RGB-NIR dataset and network reveal the power of multi-modal fusion for teasing out fine-grained differences.

Ye Wang, Wei Lu, Zhi-Hui You +6

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Eunseong Choi +111w ago·also School of Mechanical Engineering, Sejong University

Benchmarking Visual Feature Representations for LiDAR-Inertial-Visual Odometry Under Challenging Conditions

Hybrid LiDAR-inertial-visual odometry (LIVO) robustly handles visually challenging conditions, outperforming sparse-direct methods by combining direct photometric methods with learning-based feature descriptors.

Eunseong Choi, Junwoo Hong, Daehan Lee +9

Computer Vision Eval Frameworks & Benchmarks Robotics & Embodied AI

Anagani Bhanusree +41w ago

Comparative Analysis of Large Language Models in Generating Telugu Responses for Maternal Health Queries

Prompting language significantly impacts the accuracy and coherence of LLM responses for maternal health queries in Telugu, with GeminiAI favoring English prompts and Perplexity AI preferring Telugu.

Anagani Bhanusree, Sai Divya Vissamsetty, K. Rao +2

Eval Frameworks & Benchmarks Natural Language Processing

1w ago·also Radboud, UvA

Total Recall QA: A Verifiable Evaluation Suite for Deep Research Agents

Current benchmarks fail to rigorously evaluate deep research agents, but a new framework leveraging structured knowledge bases and synthetic data offers a verifiable and scalable solution.

Mahta Rafiee, Heydar Soudani, Zahra Abbasiantaeb +3

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval+1

Yueying Zou +71w ago

GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?

Smaller open-source models can outperform larger proprietary LVLMs on specific authenticity cues in AI-generated video detection, challenging the assumption that scale alone guarantees better performance.

Yueying Zou, Peiming Li, Pei Pei Li +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Meta AI1w ago·also CMU ML, CAS, UNC

Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

On-policy reward modeling with LLM judges not only unlocks significant performance gains on complex mathematical reasoning tasks, but also generalizes to improve performance on simpler numerical and multiple-choice benchmarks.

Pranjal Aggarwal, Marjan Ghazvininejad, Seungone Kim +22

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

Project Lead1w ago

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Agentic AI systems are still far from maximizing hardware potential: SOL-ExecBench reveals a significant gap between current GPU kernel performance and analytically derived Speed-of-Light bounds across a wide range of AI models.

Edward Lin, Sahil Modi, Siva Kumar Sastry Hari +37

Code Generation & Program Synthesis Distributed Systems & Hardware Eval Frameworks & Benchmarks

Carlos Hinojosa +31w ago

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

VLMs' safety judgments are easily manipulated by simple semantic cues, revealing a reliance on superficial associations rather than true visual understanding.

Carlos Hinojosa, Clemens Grange, Bernard Ghanem +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models

Bruna Alves +31w ago

Revisiting OmniAnomaly for Anomaly Detection: performance metrics and comparison with PCA-based models

Deep learning's dominance in time series anomaly detection may be overstated: a carefully evaluated PCA baseline rivals the performance of the widely-used OmniAnomaly.

Bruna Alves, Ana Martins, Armando J. Pinho +1

Eval Frameworks & Benchmarks

Yogesh Agrawal +51w ago

FinTradeBench: A Financial Reasoning Benchmark for LLMs

LLMs still struggle to reason about financial time-series data, even when they ace the textual fundamentals.

Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Yilin Wang +71w ago

DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering

Multilingual question answering is harder than you think: even state-of-the-art RAG systems stumble when dealing with questions and knowledge in multiple languages.

Yilin Wang, Yuchun Fan, Jiaoyang Li +5

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Jonah Leshin +31w ago·also Project VAIL

Behavioral Fingerprints for LLM Endpoint Stability and Identity

LLM endpoints can appear "healthy" according to traditional metrics while undergoing subtle behavioral shifts detectable by monitoring output distributions, highlighting a critical gap in current reliability practices.

Jonah Leshin, Manish Shah, Ian Timmis +1

Distributed Systems & Hardware Eval Frameworks & Benchmarks Inference & Quantization

Tsinghua AI1w ago

NavTrust: Benchmarking Trustworthiness for Embodied Navigation

Embodied navigation agents, already struggling, fall apart when faced with the kinds of messy, real-world sensor and instruction corruptions that NavTrust now exposes.

Huaide Jiang, Huai-Zhou Jiang, Yashdeep Chaudhary +11

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

KT Tech innovation Group1w ago

Mi:dm K 2.5 Pro

Forget scaling laws: Mi:dm K 2.5 Pro proves that targeted training pipelines and data curation can enable a 32B parameter model to achieve state-of-the-art performance in enterprise reasoning tasks, especially in low-resource languages like Korean.

KT Tech innovation Group

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Pius Horn +21w ago

Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

LLMs beat traditional metrics at judging PDF table extraction quality, finally offering a way to evaluate semantic correctness, not just structural similarity.

Pius Horn, J. Keuper, Janis Keuper

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Mar 18, 2026

2w ago

Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor

Current AI safety filters can't tell a joke from a threat, especially when humor relies on cultural context – this new benchmark exposes that blind spot.

Ahmed Sharshar, Hosam Elgendy, Saad El Dine Ahmed +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models+1

2w ago

Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)

SAM3 disappoints in eye image segmentation, failing to surpass SAM2's performance despite its new concept prompting mode.

Diederick C. Niehorster, Marcus Nyström

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Cem Uluoglakci +12w ago

Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination

Teaching LLMs to say "I don't know" is now possible via targeted SFT, slashing hallucination rates without sacrificing performance on other tasks.

Cem Uluoglakci, Tugba Taskaya Temizel

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Stupid Human2w ago·also Oxford

Auditing Preferences for Brands and Cultures in LLMs

LLMs exhibit consistent and detectable geographic preferences for brands and cultures, revealing potential biases in market intermediation that persist across user personas.

Jasmine Rienecker, Jasmine Rienecker, Katarina Mpofu +9

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Recommendation & Information Retrieval

2w ago·also WHU

From Isolated Scoring to Collaborative Ranking: A Comparison-Native Framework for LLM-Based Paper Evaluation

Stop training LLMs to assign arbitrary scores to papers in isolation; comparison-based ranking unlocks significantly better generalization and accuracy in paper evaluation.

P. Zheng, Pujun Zheng, Jiacheng Yao +8

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

D. Kumaran +72w ago

How do LLMs Compute Verbal Confidence

LLMs don't just regurgitate token probabilities when expressing confidence; they perform a more sophisticated, cached self-evaluation of answer quality.

D. Kumaran, Dharshan Kumaran, Arthur Conmy +5

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

H. Samanta2w ago

Impact of automatic speech recognition quality on Alzheimer's disease detection from spontaneous speech: a reproducible benchmark study with lexical modeling and statistical validation

Counterintuitively, better speech recognition unlocks surprisingly accurate Alzheimer's detection from simple text analysis, outperforming more complex acoustic models.

H. Samanta

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Gregory N. Frank2w ago

Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

Alignment evaluations that only check for dangerous concepts or outright refusals are missing the real action: models are getting sneakier at censorship by steering narratives instead of simply saying "no."

Gregory N. Frank

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Stanford HAI2w ago

ReSteer: Quantifying and Refining the Steerability of Multitask Robot Policies

Robots often ignore your commands mid-task, but ReSteer offers a way to fix this by pinpointing and patching the "blind spots" in their training data.

Zhenyang Chen, Alan Tian, Alan Tian +12

Eval Frameworks & Benchmarks Robotics & Embodied AI Tool Use & Agents

2w ago·also Northwestern

Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation

Current LMMs can't reliably turn complex images into code, failing to preserve structural integrity even in relatively simple scenarios, as shown by the new Omni-I2C benchmark.

Chi Zhang, Xiang Feng, Qiming Zhang +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Multimodal Models

Oliver Zahn +12w ago

Facts as First Class Objects: Knowledge Objects for Persistent LLM Memory

LLMs forget up to 60% of facts when summarizing and erode over half of project constraints during iterative compaction, but a simple discrete memory system (KOs) fixes this while slashing costs by 252x.

Oliver Zahn, Simran Chana

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Bassam Adnan +32w ago

ArchBench: Benchmarking Generative-AI for Software Architecture Tasks

Software architecture, a critical but underspecified domain, finally gets a unified benchmarking platform with ArchBench, enabling standardized evaluation of LLMs on complex system design tasks.

Bassam Adnan, Aviral Gupta, Sreemaee Akshathala +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Eval Frameworks & Benchmarks

GE HealthCare2w ago

Negation is Not Semantic: Diagnosing Dense Retrieval Failure Modes for Trade-offs in Contradiction-Aware Biomedical QA

Seemingly sophisticated dense retrieval methods can catastrophically fail at contradiction detection due to "Semantic Collapse," highlighting the surprising effectiveness of a simple, decoupled lexical approach for reliable biomedical QA.

S. Sahoo, Soumya Ranjan Sahoo, N. Gagan +3

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Leonardo Defilippis +32w ago

A Noise Sensitivity Exponent Controls Large Statistical-to-Computational Gaps in Single- and Multi-Index Models

A single Noise Sensitivity Exponent (NSE) dictates when learning becomes computationally intractable in high-dimensional single- and multi-index models.

Leonardo Defilippis, Florent Krzakala, Bruno Loureiro +1

Eval Frameworks & Benchmarks

Argentina Anna Rescigno +52w ago

ConGA: Guidelines for Contextual Gender Annotation. A Framework for Annotating Gender in Machine Translation

Current machine translation systems exhibit systematic masculine overuse and inconsistent feminine realization when translating from gender-neutral languages, a problem that can now be quantified thanks to a new gold-standard annotation framework.

Argentina Anna Rescigno, Argentina Anna Rescigno, Eva Vanmassenhove +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Chiara Manna +52w ago

Gender Disambiguation in Machine Translation: Diagnostic Evaluation in Decoder-Only Architectures

Instruction tuning can reduce masculine bias in decoder-only MT models, but these models still don't consistently outperform encoder-decoder architectures on gender-specific translation tasks.

Chiara Manna, Hosein Mohebbi, Afra Alishahi +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Alireza Sadeghi +12w ago

Causal Representation Learning on High-Dimensional Data: Benchmarks, Reproducibility, and Evaluation Metrics

Current CRL benchmarks often fail to provide a holistic view of model performance, hindering progress, but a new aggregate metric could change that.

Alireza Sadeghi, Wael AbdAlmageed

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Open-Source Models & Weights

Universidad ORT Uruguay2w ago

TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

Simply prompting for test-driven development can *increase* regressions in AI coding agents; instead, focus on surfacing contextual information about which tests are most relevant.

Pepe Alonso

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Z.H. College of Engineering & Technology2w ago·also Aligarh Muslim University, Interdisciplinary Center for Artificial Intelligence

Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval

LLMs can be systematically shifted from stochastic pattern-matchers to verified truth-seekers using a carefully orchestrated, multi-stage retrieval and verification pipeline.

Md. Asraful Haque, Aasar Mehdi, Maaz Mahboob +1

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval+1

Risham Sidhu +32w ago

Grid Spatial Understanding: A Dataset for Textual Spatial Reasoning over Grids, Embodied Settings, and Coordinate Structures

LLMs struggle with spatial reasoning in embodied settings and 3D structure identification even when exposed to visual modalities, but fine-tuning smaller models offers a surprisingly effective alternative to brute-force scaling.

Risham Sidhu, Risham Sidhu, Julia Hockenmaier +1

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Robotics & Embodied AI

Stanford HAI2w ago

Deployment and Evaluation of an EHR-integrated, Large Language Model-Powered Tool to Triage Surgical Patients

Automating surgical patient triage with an LLM achieves 94% sensitivity, but discrepancies reveal more about clinical workflow gaps than AI errors.

Janelle B. Wang, T. Keyes, April S. Liang +11

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

2w ago

HWE-Bench: Can Language Models Perform Board-level Schematic Designs?

LLMs can read datasheets, but still can't design circuits, failing at basic physical intuition despite showing promise in documentation understanding.

Weibo Qiu, Yinhao Xiao, Runyu Pan

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Munich Center for Machine Learning2w ago·also Google Research

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

MLLMs are surprisingly prone to hallucinating subtle details, especially when asked about the absence of specific attributes or relationships within an image.

Rui Xiao, Sanghwan Kim, Yongqin Xian +2

Eval Frameworks & Benchmarks Multimodal Models

Priyaranjan Pattnayak +12w ago

IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

LLM safety doesn't translate: evaluations across 12 Indic languages reveal alarming safety drift and inconsistent responses to sensitive topics.

Priyaranjan Pattnayak, Sanchari Chowdhuri

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Yaoyu Wang +92w ago·also Corresponding author, University of Innsbruck, USTC

Event-Centric Human Value Understanding in News-Domain Texts: An Actor-Conditioned, Multi-Granularity Benchmark

Current AI struggles to understand human values in real-world news events, often missing the who, what, and why – until now.

Yaoyu Wang, Yao Wang, Xin Liu +7

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

F. Caccavale +132w ago

Large Language Models in Teaching and Learning: Reflections on Implementing an AI Chatbot in Higher Education

Students perceive AI assistants as less intimidating and more approachable than human teachers, but also recognize limitations in specialized knowledge and nuanced feedback.

F. Caccavale, Fiammetta Caccavale, C. L. Gargalo +11

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

2w ago

Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety

Current LLM agent safety benchmarks are missing over 20% of unsafe behaviors, even after agents pass the benchmark.

Xuan Chen, Lu Yan, Ruqi Zhang +1

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Amine Lbath +12w ago

Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection

Automated injection of realistic vulnerabilities and synthesis of PoV exploits finally makes scalable, precisely labeled, repository-level vulnerability datasets a reality.

Amine Lbath, Amine Lbath

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Nathan Zhao2w ago

WebPII: Benchmarking Visual PII Detection for Computer-Use Agents

Current PII detection models are blind to the transaction-level identifiers and partially-filled forms that computer-use agents readily expose, but a new benchmark closes the gap.

Nathan Zhao

Computer Vision Data Curation & Synthetic Data Eval Frameworks & Benchmarks

2w ago

Bootstrapping Coding Agents: The Specification Is the Program

Forget about chasing the perfect model architecture – this work suggests the real key to better AI agents lies in crafting more precise and complete specifications, since the implementation can always be re-generated.

M. Monperrus, Martin Monperrus

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

2w ago

From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation

Current machine translation systems often fail to capture the nuances of culturally-loaded expressions, highlighting a critical gap in their ability to truly understand and translate language.

Bangju Han, Yingqi Wang, Huang Qing +9

Eval Frameworks & Benchmarks Natural Language Processing

Zichen Tang +52w ago

Is Your LLM-as-a-Recommender Agent Trustable? LLMs'Recommendation is Easily Hacked by Biases (Preferences)

LLM-powered recommendation agents, despite their reasoning prowess, are easily manipulated by contextual biases in high-stakes scenarios like paper review and job recruitment.

Zichen Tang, Ziru Zhang, Zirui Zhang +3

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

Alexander D. Goldie +192w ago

Procedural Generation of Algorithm Discovery Tasks in Machine Learning

Stop benchmarking algorithm discovery on the same old saturated datasets: DiscoGen offers millions of fresh, configurable tasks to truly test your ADA.

Alexander D. Goldie, Zilin Wang, Adrian Hayler +17

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Training Efficiency & Optimization

Oksana Kolomenko +22w ago

Embedding World Knowledge into Tabular Models: Towards Best Practices for Embedding Pipeline Design

Forget chasing leaderboard hype: this study reveals that larger embedding models and strategic concatenation are key to unlocking LLM-powered tabular prediction, regardless of public rankings.

Oksana Kolomenko, Ricardo Knauer, Erik Rodner

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Search

Eval Frameworks & Benchmarks - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (100)