April 20 – April 27, 2026

Eval Frameworks & Benchmarks - Weekly Roundup

100 papers published across 4 labs.

3600% acceleration

Selected Labs publishing this week

CMU ML2 BAIR1 UW1 DeepMind1

Top Papers

Apr 27, 2026

Bilkent UniversityApr 27, 2026·also Adelaide University

Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions

Evaluating LLM-powered software engineering tools is fundamentally broken, as traditional metrics fail to capture the nuanced, non-deterministic nature of their outputs.

U. B. Torun, Veli Karakaya, Ali Babar +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Apr 23, 2026

Runheng Liu +3Apr 23, 2026

Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model

Forget fine-tuning: detecting AI-generated text is possible zero-shot, simply by comparing probabilities from instruction-tuned and base LLMs.

Runheng Liu, Heyan Huang, Xingchen Xiao +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Apr 27, 2026

Emaan Bilal Khan +3Apr 27, 2026

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

Fine-tuning your LLM can drastically alter its safety profile in unpredictable ways, even turning safe models unsafe.

Emaan Bilal Khan, Amy Winecoff, Miranda Bogen +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Kushal Raj Bhandari +4Apr 27, 2026

Improving Robustness of Tabular Retrieval via Representational Stability

Seemingly innocuous choices in table serialization format (CSV vs. HTML) can drastically alter retrieval performance, but a simple centroid-based correction can restore semantic consistency.

Kushal Raj Bhandari, Adarsh Singh, Jianxi Gao +2

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

Apr 27, 2026

ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

Current VLM spatial reasoning benchmarks are misleading, as they often penalize models for "incorrect" answers that are actually correct given the limited visual information the models receive.

Yiming Zhang, Jiacheng Chen, Jiaqi Tan +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

All Papers (100)

Apr 27, 2026

Emaan Bilal Khan +3Apr 27, 2026

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

Fine-tuning your LLM can drastically alter its safety profile in unpredictable ways, even turning safe models unsafe.

Emaan Bilal Khan, Amy Winecoff, Miranda Bogen +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Kushal Raj Bhandari +4Apr 27, 2026

Improving Robustness of Tabular Retrieval via Representational Stability

Seemingly innocuous choices in table serialization format (CSV vs. HTML) can drastically alter retrieval performance, but a simple centroid-based correction can restore semantic consistency.

Kushal Raj Bhandari, Adarsh Singh, Jianxi Gao +2

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

Apr 27, 2026

ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

Current VLM spatial reasoning benchmarks are misleading, as they often penalize models for "incorrect" answers that are actually correct given the limited visual information the models receive.

Yiming Zhang, Jiacheng Chen, Jiaqi Tan +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Dhruv GuptaApr 27, 2026

Null Measurability at the Symmetrization Interface in VC Learning

Turns out, you don't need Borel measurability for symmetrization in VC learning; null measurability is sufficient.

Dhruv Gupta

Eval Frameworks & Benchmarks Scalable Oversight & Alignment Theory

Abhijay Deevi +5Apr 27, 2026

CAN-QA: A Question-Answering Benchmark for Reasoning over In-Vehicle CAN Traffic

LLMs can parrot CAN bus data, but CAN-QA reveals they fail at the temporal reasoning and multi-condition inference needed for real-world vehicle security forensics.

Abhijay Deevi, Abhijay Deevi, Onat Gungor +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Joshua Sherwood +5Apr 27, 2026

Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver

Frontier AI agents can now autonomously recreate sophisticated ML pipelines like AlphaZero for Connect Four, signaling a leap in their ability to accelerate AI research itself.

Joshua Sherwood, Joshua Sherwood, Ben Aybar +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

CMU MLApr 27, 2026

Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

Today's best web agents are shockingly inefficient, achieving only 1.15% trajectory efficiency on realistic long-horizon tasks, revealing a critical need to move beyond simple success rates.

Lawrence Keunho Jang, L. Jang, Jing Yu Koh +5

Eval Frameworks & Benchmarks Tool Use & Agents

Xinming Tu +5Apr 27, 2026

BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks

LLM benchmarks are riddled with hidden flaws that even human experts miss, but can be caught with an automated LLM auditor for under $15 per benchmark.

Xinming Tu, Tianze Wang, Yingzhou Lu +3

Eval Frameworks & Benchmarks Tool Use & Agents

BAIRApr 27, 2026·also Melbourne, UIUC, University of California, University of Georgia

Green Shielding: A User-Centric Approach Towards Trustworthy AI

LLMs exhibit Pareto-like tradeoffs in medical diagnosis, where neutralizing user prompts to improve plausibility and conciseness can simultaneously reduce coverage of critical conditions.

Aaron Li, Nicola Sanchez, Hao Huang +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Sreehari Sankar +10Apr 27, 2026

Analyzing LLM Reasoning to Uncover Mental Health Stigma

LLMs harbor surprisingly nuanced and pervasive mental health stigma, revealed only by dissecting their reasoning steps, not just their final answers.

Sreehari Sankar, Aliakbar Nafar, M. Barman +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Yunsu Kim +2Apr 27, 2026

GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation

Machine translation alone ruins agent benchmark validity across languages, but careful functional and cultural alignment can close the performance gap by up to 30%.

Yunsu Kim, Kaden Uhlig, Joern Wuebker

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Sercan Karakacs +1Apr 27, 2026

Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation

LLMs fail to reliably track source trustworthiness in Turkish evidential marking, unlike humans, highlighting a critical gap in their ability to perform nuanced reasoning based on source reliability.

Sercan Karakacs, Yusuf cSimcsek

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Daneshvar Amrollahi +2Apr 27, 2026

Faithful Autoformalization via Roundtrip Verification and Repair

LLMs can now formalize natural language with significantly higher fidelity, thanks to a clever roundtrip verification method that self-diagnoses and repairs translation errors.

Daneshvar Amrollahi, Jerry Lopez, Clark W. Barrett

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Apr 27, 2026

Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

LLMs still can't pass history class: even state-of-the-art models struggle with complex historical reasoning, as revealed by a new benchmark based on the Chinese Imperial Examination.

Lirong Gao, Zeqing Wang, Yuyan Cai +6

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Nay Myat Min +2Apr 27, 2026

Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models

A single, tuning-free "health signal" derived from layer activations can catch backdoors, jailbreaks, and prompt injections in LLMs, even without a clean reference model.

Nay Myat Min, Long H. Pham, Jun Sun

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Hermawan Manurung +6Apr 27, 2026

Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking

A BiLSTM with a custom slang dictionary rivals AutoML in classifying the sentiment and emotion of messy, real-world Indonesian e-commerce reviews.

Hermawan Manurung, Hermawan Manurung, Ibrahim Al-Kahfi +4

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Aaryan Shah +17Apr 27, 2026

Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

LLMs can evaluate clinical AI as well as human experts, but at 1/1000th the cost, unlocking scalable and continuous monitoring.

Aaryan Shah, Aaryan Shah, Andrew Hines +15

Eval Frameworks & Benchmarks Natural Language Processing

Soyeon Kim +5Apr 27, 2026

K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology

Scaling up LLMs doesn't guarantee expertise: Korean-specific models beat larger global models on a new meteorology benchmark, exposing critical gaps in multimodal reasoning and cultural understanding.

Soyeon Kim, Cheon-kyu Kang, Myeongjin Lee +3

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

C. O’Brien +3Apr 27, 2026

Evaluation of Pose Estimation Systems for Sign Language Translation

Your sign language translation model's performance could be bottlenecked by your choice of pose estimator: switching from MediaPipe to SDPose or Sapiens could boost BLEU score by 1.5 points.

C. O’Brien, Gerard Sant, Mathias Muller +1

Computer Vision Eval Frameworks & Benchmarks Natural Language Processing

Language Techonology InstituteApr 27, 2026·also UChicago, UTokyo

The Chameleon's Limit: Investigating Persona Collapse and Homogenization in Large Language Models

LLMs that nail individual personas can still fail spectacularly at generating diverse populations, instead defaulting to coarse stereotypes.

Yunze Xiao, Vivian Zhang, Chenghao Yang +3

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

O. Delaney +4Apr 27, 2026

Risk Reporting for Developers'Internal AI Model Use

Frontier AI companies need a standardized risk reporting framework for internal model use, and this paper provides one structured around autonomous AI misbehavior and insider threats.

O. Delaney, Sambhav Maheshwari, Joe O'Brien +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Sumanta Bhattacharyya +8Apr 27, 2026

Generating Place-Based Compromises Between Two Points of View

LLMs can learn to generate better compromises by iteratively incorporating feedback on how empathically similar a compromise is to each viewpoint, opening the door to more socially intelligent AI.

Sumanta Bhattacharyya, Francine Chen, Scott A. Carter +6

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Alessio Sordo +4Apr 27, 2026·also Berlin Technology Center

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

Forget painstakingly curating datasets – STELLAR-E auto-generates high-quality, domain-specific LLM benchmarks, rivaling real-world data in evaluation quality.

Alessio Sordo, Lingxiao Du, Meeka-Hanna Lenisa +2

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Iizalaarab Elhaimeur +3Apr 27, 2026

ITAS: A Multi-Agent Architecture for LLM-Based Intelligent Tutoring

LLM-based tutors can accumulate more data about students than instructors can access, creating a "Blind Instructor Problem" that this multi-agent system tackles head-on.

Iizalaarab Elhaimeur, Iizalaarab Elhaimeur, Nikos Chrisochoides +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Qi Li +10Apr 27, 2026

A Comparative Evaluation of AI Agent Security Guardrails

DKnownAI Guard blows away AWS, Azure, and Lakera in head-to-head security tests for AI agents.

Qi Li, Jiu Li, Pingtao Wei +8

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Apr 27, 2026

Poisoning Learned Index Structures: Static and Dynamic Adversarial Attacks on ALEX

Learned indexes, despite their promise, can suffer up to 2.8x lookup slowdowns under targeted dynamic attacks, but only if the data distribution isn't too dense.

Allen Jue

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Enis Golaszewski +10Apr 27, 2026

Verifying Provenance of Digital Media: Why the C2PA Specifications Fall Short

C2PA, the leading standard for verifying digital media provenance, fails to meet its security goals, potentially misleading users in critical applications like journalism and legal evidence.

Enis Golaszewski, N. Krawetz, Alan T. Sherman +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Pablo Mateo-Torrej'on +1Apr 27, 2026

GAMMAF: A Common Framework for Graph-Based Anomaly Monitoring Benchmarking in LLM Multi-Agent Systems

LLM multi-agent systems can substantially reduce operational costs by using effective attack remediation to facilitate early consensus and cut off token generation by adversarial agents, as shown by GAMMAF.

Pablo Mateo-Torrej'on, Alfonso S'anchez-Maci'an

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

V'ictor Mayoral-Vilches +8Apr 27, 2026

Dynamic Cyber Ranges

Forget static defenses: LLM-powered "Defender" agents can dynamically harden cyber ranges, slashing attacker success rates and leveling the playing field as AI-driven threats evolve.

V'ictor Mayoral-Vilches, Mar'ia Sanz-G'omez, Francesco Balassone +6

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Hikmat Karimov +1Apr 27, 2026

An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress

LLM stability under uncertainty isn't just about accuracy – a new information-geometric framework reveals how internal model structure non-linearly attenuates the impact of disorder.

Hikmat Karimov, Rahid Z. Alekberli

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Apr 27, 2026

When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation

Under-specifying prompts can *improve* LLM code generation correctness by breaking misleading cues that trigger incorrect retrieval-based solutions.

Amal Akli, Mike Papadakis, Maxime Cordy +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Apr 27, 2026

Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

Turns out, a tiny fine-tuned model can spot flaws in coding instructions that trip up even the biggest LLMs, suggesting we're over-relying on brute force for code generation.

Amal Akli, Mike Papadakis, Maxime Cordy +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Phat T. Tran-Truong +1Apr 27, 2026

Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents

LLM agent reliability metrics hide a wealth of information: modeling execution traces as Markov chains reveals the underlying success-time distribution and quantifies uncertainty, offering a richer understanding of agent behavior.

Phat T. Tran-Truong, X. Le

Eval Frameworks & Benchmarks Tool Use & Agents

Veli Karakaya +3Apr 27, 2026·also Bilkent University

Understanding the Limits of Automated Evaluation for Code Review Bots in Practice

Automated evaluations of code review bots disagree with developer feedback nearly 40% of the time, revealing that developer actions are driven by workflow pressures, not just code quality.

Veli Karakaya, U. B. Torun, Baykal Mehmet Uccar +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Bilkent UniversityApr 27, 2026·also Adelaide University

Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions

Evaluating LLM-powered software engineering tools is fundamentally broken, as traditional metrics fail to capture the nuanced, non-deterministic nature of their outputs.

U. B. Torun, Veli Karakaya, Ali Babar +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Laila Elkoussy +1Apr 27, 2026

SWE-QA: A Dataset and Benchmark for Complex Code Understanding

Even the largest language models still struggle to connect information across dispersed code segments, achieving only 74% accuracy on a new benchmark designed to test multi-hop code comprehension.

Laila Elkoussy, Julien Perez

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Apr 27, 2026·also Columbia

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

Benchmarks alone don't tell the whole story: AgentPulse reveals that real-world adoption signals often diverge significantly from static performance metrics, especially for closed-source, high-capability agents.

Yuxuan Gao, Megan Wang, Yi Ling Yu

Eval Frameworks & Benchmarks Tool Use & Agents

Jongwoo Nam +2Apr 27, 2026

ShapeY: A Principled Framework for Measuring Shape Recognition Capacity via Nearest-Neighbor Matching

Even the best vision models make shockingly bad shape recognition errors, like confusing a car with a chair, when evaluated on a new viewpoint-invariant shape recognition benchmark.

Jongwoo Nam, Amanda Rios, Bartlett W. Mel

Computer Vision Eval Frameworks & Benchmarks

F. Gustafsson +4Apr 27, 2026

Benchmarking Pathology Foundation Models for Breast Cancer Survival Prediction

Scaling up pathology foundation models doesn't guarantee better survival prediction—a distilled model with 8% of the parameters can outperform its larger teacher.

F. Gustafsson, C. Boissin, J. Vallon-Christersson +2

Computer Vision Eval Frameworks & Benchmarks Scientific Discovery & Drug Design

Sheng Zhong +9Apr 27, 2026

Event-based SLAM Benchmark for High-Speed Maneuvers

Current event-based SLAM algorithms falter when faced with the full complexity of high-speed, 6-DoF maneuvers, highlighting a gap between current capabilities and the promise of event cameras.

Sheng Zhong, Junkai Niu, Guillermo Gallego +7

Computer Vision Eval Frameworks & Benchmarks Robotics & Embodied AI

Zaid Mahboob +2Apr 27, 2026

Betting for Sim-to-Real Performance Evaluation

Ditch expensive robot trials: a novel "betting" framework lets you accurately predict real-world robot performance using only cheap simulations.

Zaid Mahboob, Yujia Chen, Bowen Weng

Eval Frameworks & Benchmarks Robotics & Embodied AI World Models & Planning

Apr 27, 2026

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

ASR systems can now be more trustworthy: this work shows how to train them to abstain from transcribing uncertain segments, leading to more reliable outputs.

Wen-Chin Huang, Yuhang Qiu, Bohan Li +5

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Hai Wang +3Apr 27, 2026

Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics

CLIP models, despite their prowess, stumble when understanding 360° images, failing to maintain semantic alignment under horizontal circular shifts.

Hai Wang, Xiaocheng Yang, Mingzhi Dong +1

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Weixing Wang +7Apr 27, 2026

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

Unified multimodal models can ace visual understanding and generation tasks, yet still fail to maintain basic semantic consistency between them.

Weixing Wang, Liudvikas Zekas, Anton Hackl +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Theresia Veronika RampiselaApr 27, 2026

Offline Evaluation Measures of Fairness in Recommender Systems

Many recommender system fairness metrics are flawed, producing scores that are uninterpretable, inexpressive, or even incalculable in common scenarios.

Theresia Veronika Rampisela

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Apr 27, 2026

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

Audio-Language models are cheating on benchmarks, acing tests even when they barely listen.

Leonardo Haw-Yang Foo, Chih-Kai Yang, Chen-An Li +2

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Apr 27, 2026·also New Laboratory of Pattern Recognition, PolyU

AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

Existing GUI agents can parrot actions, but AutoGUI-v2 reveals they still lack a deep understanding of GUI functionality and struggle to predict the outcomes of even simple interactions.

Hongxin Li, Hongxin Li, Xiping Wang +10

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Apr 26, 2026

UWApr 26, 2026

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

LLM agents struggle to maintain performance in multi-day collaborative tasks, dropping significantly after just one environmental update, revealing a critical gap in adaptation to evolving real-world conditions.

Fanqing Meng, Lingxiao Du, Zijian Wu +42

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

T. Kumar +4Apr 26, 2026·also Birla Institute of Technology

Personality Shapes Gender Bias in Persona-Conditioned LLM Narratives Across English and Hindi: An Empirical Investigation

LLMs' gender biases aren't fixed; they warp and intensify based on the *personality* you give them, especially when those personalities lean toward the "Dark Triad."

T. Kumar, Shreya Gautam, Aman Chadha +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Apr 25, 2026

DeepMindApr 25, 2026·also Co-leads

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

Generative AI evaluation can be sped up by 8-65x without sacrificing accuracy by proactively focusing on the most informative test cases using a pre-trained Gaussian Process surrogate model.

Yizheng Huang, Wenjun Zeng, Aditi Kumaresan +1

Eval Frameworks & Benchmarks Training Efficiency & Optimization

Apr 24, 2026

Yaxuan Li +4Apr 24, 2026

dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

Forget slow, expensive real-world trials: dWorldEval's discrete diffusion world model lets you evaluate robot policies across thousands of environments and tasks with unprecedented speed and accuracy.

Yaxuan Li, Zhongyi Zhou, Yefei Chen +2

Eval Frameworks & Benchmarks Robotics & Embodied AI World Models & Planning

Apr 24, 2026

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Semantic similarity is a poor proxy for agent performance: ranking agents based on execution-aware probing beats description-based retrieval by a wide margin.

Bin Wu, Arastun Mammadli, Emine Yilmaz

Eval Frameworks & Benchmarks Tool Use & Agents

Chengye Wang +3Apr 24, 2026

TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

Existing document OCR models fail to preserve crucial structural and executable properties of LaTeX, but TexOCR, trained with verifiable rewards, finally delivers compilable page-to-LaTeX reconstruction.

Chengye Wang, Ling Fu, Zexi Kuang +1

Code Generation & Program Synthesis Computer Vision Eval Frameworks & Benchmarks

Apr 23, 2026

AI4BharatApr 23, 2026·also IIT Madras

Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

VLM evaluators, despite their growing use, can miss over 50% of targeted errors in generated images and text, especially when those errors involve fine-grained details or spatial relationships.

Mohammed Safi Ur Rahman Khan, Sanjay Suryanarayanan, Tushar Anand +1

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Q. Han +13Apr 23, 2026·also UC Santa Cruz

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

VLAA-GUI's innovative framework allows autonomous agents to not only verify their success but also adaptively recover from failures, achieving human-level performance in GUI tasks.

Q. Han, Haoqin Tu, Zijun Wang +11

Eval Frameworks & Benchmarks Tool Use & Agents

Xiaojie Xu +8Apr 23, 2026·also Shanda AI Research Tokyo, UTokyo

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Stop guessing which interactive video model is best: WorldMark offers the first apples-to-apples comparison across leading models on identical scenes and trajectories.

Xiaojie Xu, Zhengyuan Lin, Zhe Lin +6

Eval Frameworks & Benchmarks Multimodal Models World Models & Planning

Vipula Rawte +3Apr 23, 2026·also Adobe Research

Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models

LLMs can be made 20% more accurate by jointly attributing claims to sources and verifying them, rather than just verifying.

Vipula Rawte, Ryan A. Rossi, Franck Dernoncourt +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp+1

Apr 23, 2026

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

LVLMs are often tripped up not by faulty vision, but by over-trusting the textual prompt, leading to surprisingly easy-to-fix hallucinations.

Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny +3

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Paul-Tiberiu Iordache +1Apr 23, 2026

Fine-Tuning Regimes Define Distinct Continual Learning Problems

The best continual learning method for your task might depend more on *how much* of the model you fine-tune than *which* regularization strategy you use.

Paul-Tiberiu Iordache, Elena Burceanu

Eval Frameworks & Benchmarks Training Efficiency & Optimization

Natalie Collina +3Apr 23, 2026

The Sample Complexity of Multicalibration

Multicalibration demands a surprisingly high sample complexity of $\widetilde{\Theta}(\varepsilon^{-3})$, even for randomized predictors, revealing a stark difference from marginal calibration and highlighting its inherent difficulty.

Natalie Collina, Jiuyao Lu, Georgy Noarov +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks

Nicolae Filat +3Apr 23, 2026

Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

Seemingly innocuous choices about how to split a continuous data stream into discrete tasks can dramatically alter the conclusions of continual learning benchmarks, even before any model is trained.

Nicolae Filat, A. Hussain, K. Kalogiannis +1

Eval Frameworks & Benchmarks

Donggyu Lee +6Apr 23, 2026

Ideological Bias in LLMs'Economic Causal Reasoning

LLMs are more likely to get economic cause-and-effect wrong when the correct answer favors free markets, revealing a systematic ideological bias that prompting can't fix.

Donggyu Lee, H. Yun, Jungwon Kim +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Apr 23, 2026

Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts

LLMs struggle to answer human-generated questions about multi-chart images, highlighting a critical gap in their ability to reason about real-world data visualizations.

Azher Ahmed Efat, Seok Hwan Song, Wallapak Tavanapong

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Multimodal Models

Adam Skurla +2Apr 23, 2026

mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code

Adapting machine-generated text detection methods to code proves competitive, but current LLMs still struggle to reliably identify AI-generated code, especially when obfuscated.

Adam Skurla, D. Macko, Jakub Simko

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

CMU MLApr 23, 2026·also Datadog

ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response

Even GPT-5 only achieves 63% accuracy on time series anomaly questions from real software incidents, but a model-expert combination reaches 87%, highlighting the potential for hybrid intelligence in incident response.

Stephan Xie, Ben Cohen, Mononito Goswami +6

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

HiTZ CenterApr 23, 2026·also Ixa Group, University of the Basque Country UPV/EHU

Why are all LLMs Obsessed with Japanese Culture? On the Hidden Cultural and Regional Biases of LLMs

LLMs aren't just Western-centric; they have a peculiar obsession with Japan, and this bias is amplified by English-language prompting.

Joseba Fernandez de Landa, Carla Pérez-Almendros, J. Camacho-Collados

Constitutional AI & AI Ethics Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Natan Levy +1Apr 23, 2026

Bounding the Black Box: A Statistical Certification Framework for AI Risk Regulation

Forget guessing games – this framework finally offers a concrete, auditable way to prove your AI system is acceptably safe before deployment, even if it's a black box.

Natan Levy, Gadi Perl

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks

Safouane El Ghazouali +3Apr 23, 2026

SyMTRS: Benchmark Multi-Task Synthetic Dataset for Depth, Domain Adaptation and Super-Resolution in Aerial Imagery

A new synthetic aerial imagery dataset provides pixel-perfect depth, controlled illumination, and multi-scale imagery, unlocking joint research across geometric understanding, domain robustness, and resolution enhancement.

Safouane El Ghazouali, Nicola Venturi, Michael Rueegsegger +1

Computer Vision Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Apr 23, 2026·also SNU

Who Defines"Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards

LLM leaderboard rankings are more a reflection of benchmark designer priorities than actual user needs, but a new interactive visualization tool lets you reshape those rankings based on your specific prompt types and goals.

Mi-Gyeong Jung, Minjae Lee, Yejin Kim +2

Eval Frameworks & Benchmarks Natural Language Processing

Hao-Yuan ChenApr 23, 2026

Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

Forget chain-of-thought prompting – iterative refinement guided by structured verbal critique from a stronger LLM can achieve SOTA reasoning performance without any training.

Hao-Yuan Chen

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Kaushitha Silva +1Apr 23, 2026

DryRUN: On the Role of Public Tests in LLM-Driven Code Generation

LLMs can debug code *without* human-provided test cases, autonomously generating inputs and execution traces to match the performance of public-test-dependent methods while reducing token consumption.

Kaushitha Silva, Srinath Perera

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

JetBrains ResearchApr 23, 2026·also TU Delft

A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair

LLMs' apparent success at program repair crumbles when faced with slightly altered versions of known bugs, revealing a reliance on memorization rather than true understanding.

Milan De Koning, Milan de Koning, Ali Asgari +5

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Nevena Lazi'c +3Apr 23, 2026

To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning

Unseen token generalization in transformers isn't just about copying; it's fundamentally limited by a representational collapse in the unembedding space.

Nevena Lazi'c, Liam H. Fowl, Andr'as Gyorgy +1

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Jingyang Li +3Apr 23, 2026

Probabilistic Verification of Neural Networks via Efficient Probabilistic Hull Generation

Guaranteeing safety bounds for neural networks under probabilistic input disturbances is now more tractable thanks to a new approach that efficiently carves out safe and unsafe regions.

Jingyang Li, Xin Chen, Hongfei Fu +1

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Shivam Rawat +3Apr 23, 2026

Reasoning Primitives in Hybrid and Non-Hybrid LLMs

Hybrid architectures that combine attention and recurrence can maintain reasoning performance as task complexity increases, while transformers see a sharp performance drop-off.

Shivam Rawat, Lucie Flek, Florian Mai +1

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Jinhee Jang +4Apr 23, 2026

FairQE: Multi-Agent Framework for Mitigating Gender Bias in Translation Quality Estimation

Existing translation quality estimation models exhibit systematic gender bias, but FairQE shows you can fix this without hurting overall accuracy.

Jinhee Jang, Juhwan Choi, DongJin Lee +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Michael Bouzinier +4Apr 23, 2026

Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages

Guarantee that clinical decisions are based on appropriate evidence *before* deployment, not just explained after the fact.

Michael Bouzinier, S. Trifonov, Michael Chumack +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

Mohit Vaishnav +1Apr 23, 2026

Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning

VLMs' struggles with abstract visual reasoning aren't primarily due to weak reasoning, but rather a representational bottleneck in extracting the right symbolic information from pixels.

Mohit Vaishnav, T. Tammet

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Apr 23, 2026·also B (2.53) outperforms low-compression

Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition

Counterintuitively, scaling up LLM decoders in speech recognition doesn't guarantee fairness; audio encoder design matters more, as Whisper's pathological hallucinations on Indian-accented speech and repetition loops under masking demonstrate.

Srishti Ginjala, E. Fosler-Lussier, Christopher Myers +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Speech & Audio

Philip Zhong +3Apr 23, 2026

Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline

GPT-4.1-mini wins on accuracy for meeting summarization, but GPT-5.1 crushes it on completeness and coverage, revealing that the best model depends on the specific metric you care about.

Philip Zhong, Don Wang, Jason Zhang +1

Eval Frameworks & Benchmarks Natural Language Processing

Jin Guo +2Apr 23, 2026

Can MLLMs"Read"What is Missing?

MLLMs struggle to "read" missing text directly from visual context, even when they possess the necessary visual grounding and layout understanding.

Jin Guo, Xi Fang, Chaozheng Huang

Eval Frameworks & Benchmarks Multimodal Models

Robin Dey +1Apr 23, 2026

Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture

MemPalace's impressive memory recall isn't due to its fancy "memory palace" spatial organization, but rather its simple "store everything verbatim" approach combined with a strong embedding model.

Robin Dey, Panyanon Viradecha

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Apr 23, 2026

Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms

LLMs' factual knowledge is surprisingly brittle: simply changing an entity's surface form in a question (e.g., using an abbreviation instead of the full name) can drastically alter the answer.

Yuto Nishida, Naoki Shikoda, Yosuke Kishinami +4

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Tasnim Kabir +5Apr 23, 2026

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

SOTA audio QA models are getting punked by trivia questions a toddler could answer, revealing a stark gap between current capabilities and true audio understanding.

Tasnim Kabir, Dmytro Kurdydyk, Aadi Palnitkar +3

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Apr 23, 2026

Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

LLMs may fail in real-world moral decisions because they rigidly adhere to fairness norms, even when their own internal models predict humans would prioritize loyalty.

Jiseon Kim, Jea Kwon, L. Vecchietti +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Johannes Gutenberg University MainzApr 23, 2026·also Universidad Iberoamericana

From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation

LLMs generating ML pipelines are far more likely to inject sensitive attributes than simple if-then statements suggest, revealing a hidden bias blind spot in current evaluation methods.

M. Bui, Xenia Heilmann, Mattia Cerrato +2

Code Generation & Program Synthesis Constitutional AI & AI Ethics Eval Frameworks & Benchmarks

Paul Keuren +2Apr 23, 2026

Finding Meaning in Embeddings: Concept Separation Curves

Sentence embeddings can be objectively evaluated for conceptual stability without relying on downstream classifiers, revealing their true capacity to capture meaning.

Paul Keuren, M. Ponsen, Robert Ayoub Bagheri

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Apr 23, 2026

When Bigger Isn't Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation

Mid-sized LLMs can actually be *more* fair in news summarization than their larger counterparts, challenging the common wisdom of "bigger is better."

Nannan Huang, Iffat Maab, Junichi Yamagishi

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Maritaca AIApr 23, 2026·also JusBrasil

Measuring Opinion Bias and Sycophancy via LLM-based Coercion

LLMs are far more likely to parrot your views in a debate than reveal their true opinions, especially when you keep pushing.

Rodrigo Nogueira, G. K. Bon'as, T. Almeida +7

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Apr 23, 2026·also Ministry of Education Key Laboratory of Intelligent Networks and Network Security, Shaanxi Province Key Laboratory of Big Data Knowledge Engineering

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

Even the most advanced LLMs like GPT-5.2 and Gemini-3 stumble on complex optimization problems, achieving only 27% accuracy on a new benchmark spanning stochastic, dynamic, and game optimization.

Xinyu Zhang, Boxuan Zhang, Yuchen Wan +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Srija Anand +15Apr 23, 2026

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

Forget English – this study reveals which TTS systems truly resonate with native speakers across ten diverse Indian languages, pinpointing specific perceptual dimensions that drive preference.

Srija Anand, Ashwin Sankar, I. Sethi +13

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Wenjie Fu +7Apr 23, 2026

CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents

Enterprise LLM agents leak sensitive information in up to 50% of interactions, and surprisingly, performing better at tasks makes the problem *worse*.

Wenjie Fu, Xiaoting Qin, Jue Zhang +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Recommendation & Information Retrieval

J. AcuñaApr 23, 2026

EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval

Structured graph memory can outperform full-context prompting for cross-session LLM reasoning, but optimizing for specific reasoning skills can hurt overall performance.

J. Acuña

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Apr 23, 2026·also Anhui Province Key Laboratory of Digital

When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors

LLM agent distillation leads to surprisingly high rates of behavioral mimicry, with some student models exhibiting tool-use habits *more* similar to their teachers than the teacher's own family members.

Chen Yang, Yuning Zhang, Zhoufutu Wen +4

Eval Frameworks & Benchmarks Inference & Quantization Tool Use & Agents

Apr 23, 2026·also School of Information Engineering

Unlocking the Power of Large Language Models for Multi-table Entity Matching

LLMs can significantly boost multi-table entity matching by cleverly coordinating attributes, embedding entities, and pruning noise.

Yingkai Tang, Taoyu Su, Wenyuan Zhang +2

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Runheng Liu +3Apr 23, 2026

Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model

Forget fine-tuning: detecting AI-generated text is possible zero-shot, simply by comparing probabilities from instruction-tuned and base LLMs.

Runheng Liu, Heyan Huang, Xingchen Xiao +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Myeong Seok Oh +8Apr 23, 2026

Subject-level Inference for Realistic Text Anonymization Evaluation

Even when you think you've scrubbed 90% of the PII, your anonymized text might still leak two-thirds of a person's identity.

Myeong Seok Oh, Dong-Yun Kim, Hanseok Oh +6

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Independent ResearcherApr 23, 2026

CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis

Static analysis tools miss a staggering 87% of real-world Python vulnerabilities when they're introduced across multiple commits, even when the full codebase is available.

Arun Majumdar

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Runzhe Hao +1Apr 23, 2026

MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks

LLM agent self-reporting is dangerously unreliable for security assessments, diverging from actual execution traces in up to 100% of critical actions, demanding a shift towards trace-based auditing.

Runzhe Hao, Zhuoran Tan

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Manuscript submitted April 20Apr 23, 2026

Benchmarking the Utility of Privacy-Preserving Cox Regression Under Data-Driven Clipping Bounds: A Multi-Dataset Simulation Study

Applying differential privacy to survival analysis can obliterate statistical significance and predictive power, even with relatively large datasets and optimistic clipping bounds.

Keita Fukuyama, Yukiko Mori, Tomohiro Kuroda +1

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design