May 1 – May 8, 2026

Eval Frameworks & Benchmarks - Weekly Roundup

100 papers published across 6 labs.

Selected Labs publishing this week

Stanford HAI2 MIT CSAIL2 Microsoft Research1 DAMO1 UW1

Top Papers

May 6, 2026

Jiayang Li +72w ago

StableI2I: Spotting Unintended Changes in Image-to-Image Transition

Existing image-to-image evaluations miss a critical aspect: whether the output image actually preserves the content of the input.

Jiayang Li, Shuo Cao, Xiaohui Li +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

University of Tennessee2w ago

Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction

Hallucination detection can be reframed as a dynamical systems problem, enabling a surprisingly effective and efficient black-box approach that avoids expensive sampling or external knowledge retrieval.

Dan Wilson, Mohamed Akrout

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

2w ago·also CNRS

On the Hardness of Junking LLMs

LLMs harbor easily discoverable "natural backdoors"—token sequences that trigger harmful outputs without any semantic instruction, revealing a concerning vulnerability beyond traditional prompt-based jailbreaks.

Marco Rando, Samuel Vaiter

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Independent Researcher New York2w ago

Jacobian-Velocity Bounds for Deployment Risk Under Covariate Drift

Regularizing model sensitivity along the expected covariate drift directions, rather than isotropically, significantly improves the robustness of frozen models deployed in non-stationary environments.

Jonathan R. Landers

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Han Wang +52w ago·also Tsinghua AI

KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

LLMs can generate GPU kernels, but they're surprisingly bad at it: 72% of fusion tasks fail across all methods, and nearly half of the "correct" kernels are actually slower than PyTorch.

Han Wang, Jintao Zhang, Kai Jiang +3

Code Generation & Program Synthesis Distributed Systems & Hardware Eval Frameworks & Benchmarks

All Papers (100)

May 6, 2026

Jiayang Li +72w ago

StableI2I: Spotting Unintended Changes in Image-to-Image Transition

Existing image-to-image evaluations miss a critical aspect: whether the output image actually preserves the content of the input.

Jiayang Li, Shuo Cao, Xiaohui Li +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

University of Tennessee2w ago

Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction

Dan Wilson, Mohamed Akrout

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

2w ago·also CNRS

On the Hardness of Junking LLMs

Marco Rando, Samuel Vaiter

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Independent Researcher New York2w ago

Jacobian-Velocity Bounds for Deployment Risk Under Covariate Drift

Jonathan R. Landers

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Han Wang +52w ago·also Tsinghua AI

KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

LLMs can generate GPU kernels, but they're surprisingly bad at it: 72% of fusion tasks fail across all methods, and nearly half of the "correct" kernels are actually slower than PyTorch.

Han Wang, Jintao Zhang, Kai Jiang +3

Code Generation & Program Synthesis Distributed Systems & Hardware Eval Frameworks & Benchmarks

University of Artificial Intelligence2w ago

Regime-Conditioned Evaluation in Multi-Context Bayesian Optimization

Unstable BO leaderboard rankings? They're likely due to ignoring the budget ratio (B/|A|) and prior rank correlation, which this paper elegantly captures with the Portable Regime Score (PRS) to predict performance reversals.

Noel Thomas

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design Training Efficiency & Optimization

Berk Sezer +32w ago

Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction

Turns out, all gaze estimation models stumble when robots look down, and complex architectures aren't the answer – data diversity is the real secret to robust human-robot interaction.

Berk Sezer, Ali Gorkem Kuccuk, Erol cSahin +1

Computer Vision Eval Frameworks & Benchmarks Robotics & Embodied AI

Jaewook Kim +12w ago

Ensuring Reliability in Programming Knowledge Tracing: A Re-evaluation of Attention-augmented Models and Experimental Protocols

Attention-based models for programming knowledge tracing might not be as effective as previously thought; careful experimental design reveals that their gains over simpler models are often overstated.

Jaewook Kim, Hyeoncheol Kim

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

2w ago·also HKU

Aes3D: Aesthetic Assessment in 3D Gaussian Splatting

Finally, a way to judge the *vibes* of your 3D Gaussian Splatting scenes, without needing to render a bunch of images.

Chuanzhi Xu, Boyu Wei, Haoxian Zhou +5

Computer Vision Eval Frameworks & Benchmarks

2w ago

The First Token Knows: Single-Decode Confidence for Hallucination Detection

Hallucination detection can be nearly as effective with a single forward pass as with expensive multi-sample methods.

Mina Gabriel

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

2w ago

Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models

Interventions on LLMs, like knowledge editing or unlearning, can have surprising side effects that this automated pipeline can now surface and validate.

Quintin Pope, Ajay Hayagreeve Balaji, Jacques Thibodeau +1

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

2w ago·also Microsoft Research, CAS

SoK: Robustness in Large Language Models against Jailbreak Attacks

Current LLM jailbreak evaluations are inadequate, often relying on narrow metrics, necessitating a multi-dimensional framework like Security Cube for comprehensive security assessment.

Feiyue Xu, Hongsheng Hu, Chaoxiang He +9

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

2w ago

Misaligned by Reward: Socially Undesirable Preferences in LLMs

Current reward models often *prefer* socially undesirable responses, revealing a critical gap in LLM alignment beyond instruction following.

Gayane Ghazaryan, Esra Dönmez

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Gosset Research2w ago

Curated AI beats frontier LLMs at pharma asset discovery

Frontier LLMs are leaving 70% of relevant pharmaceutical assets undiscovered, a gap that can be largely closed by swapping generic web search for a curated index.

Łukasz Kidziński, Kevin Thomas

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Scientific Discovery & Drug Design

Haotian Xia +62w ago·also HKU, Northwestern

StoryAlign: Evaluating and Training Reward Models for Story Generation

Current reward models are surprisingly bad at judging story quality, achieving only 66% accuracy in selecting human-preferred narratives – a gap closed by a new, purpose-built reward model.

Haotian Xia, Hao Peng, Yunjia Qi +4

Eval Frameworks & Benchmarks Natural Language Processing RLHF & Preference Learning

Stanford HAI2w ago

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

AI agents are shockingly easy to manipulate into leaking API keys, deleting user data, and initiating unauthorized transactions across a wide range of real-world applications.

Zhaorun Chen, Xun Liu, Haibo Tong +14

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Álvaro Becerra +22w ago·also School of Engineering

AISSA: Implementation and Deployment of an AI-based Student Slides Analysis tool for Academic Presentations

Automating rubric-based feedback on presentation slides is now feasible and perceived as useful, thanks to LLMs and learning analytics dashboards.

Álvaro Becerra, Diego Gómez, Ruth Cobos

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

2w ago

CodeEvolve: LLM-Driven Evolutionary Optimization with Runtime-Enriched Target Selection for Multi-Language Code Enhancement

LLM-guided code evolution, when combined with runtime feedback and MCTS, can reliably achieve 15x speedups on real-world Java code, surpassing naive LLM-based optimization.

Ajay Krishna Borra, Wenzhuo Yang, Samarth Arora +9

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Mingda Li +42w ago

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

LLM uncertainty can be efficiently estimated *without* sampling by measuring the stability of output distributions under semantically equivalent input perturbations.

Mingda Li, Rundong Lv, Xinyu Li +2

Eval Frameworks & Benchmarks Natural Language Processing

2w ago

AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair

Agent-repair leaderboards are more fragile than we thought: methods that peek at the evaluator's signals to guide internal repair choices can cause drastic reordering when the evaluator changes.

Yuelin Hu, Zhenbo Yu, Zhengxue Cheng +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Siqiao Xue +62w ago

Beyond Retrieval: A Multitask Benchmark and Model for Code Search

Developer-style keyword searches completely nullify the advantage of even the best code embedding models, highlighting a critical gap in current code search techniques.

Siqiao Xue, Zihan Liao, Jin Qin +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Xiao Wang +62w ago

From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning

Seemingly harmless fine-tuning data can stealthily nudge LLMs toward unsafe behavior by subtly shifting model parameters in "danger-aligned" directions.

Xiao Wang, Yifei Zhang, YongKang Liu +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Kuan-Hao Tseng +52w ago·also Sydney

SADE: Symptom-Aware Diagnostic Escalation for LLM-Based Network Troubleshooting

LLMs can leapfrog current network troubleshooting benchmarks by explicitly encoding structured diagnostic policies, rather than relying on free-form deliberation.

Kuan-Hao Tseng, Niruth Bogahawatta, Yasod Ginige +3

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Ivan Bondarenko +52w ago

RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

A judge-orchestrated ensemble of diverse LLMs trounces single models in multi-turn response generation, proving that strategic model selection beats brute force scaling.

Ivan Bondarenko, Roman Derunets, Oleg Sedukhin +3

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Leying Zhang +42w ago

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

LLMs can now evaluate audio as well as humans, without task-specific training, thanks to a new instruction-driven framework.

Leying Zhang, Bowen Shi, Haibin Wu +2

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Yuancheng Wei +92w ago

DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning

Current image difference captioning benchmarks fail to capture semantic consistency and penalize hallucinations, but DiffCap-Bench offers a robust alternative that aligns with human expert judgments and predicts downstream utility for image editing.

Yuancheng Wei, Haojie Zhang, Linli Yao +7

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

MIT CSAIL2w ago

Implicit Representations of Grammaticality in Language Models

LMs encode grammaticality as a distinct feature in their hidden representations, separable from raw string probability and generalizable across languages.

Yingshan Susan Wang, Linlu Qiu, Zhaofeng Wu +2

Eval Frameworks & Benchmarks Natural Language Processing

University of Calgary2w ago·also Institute University of Calgary

MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge

LLMs ace MRI multiple-choice tests, but can't actually recall basic facts about GE scanners, revealing a dangerous gap between perceived and actual competence.

Perry E. Radau

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

Yucheng Ruan +42w ago

Beyond Semantics: An Evidential Reasoning-Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction

Overconfident predictions plague mental health prediction models, but this new framework leverages evidential learning to provide more trustworthy uncertainty estimates and human-understandable reasoning signals.

Yucheng Ruan, Ling Huang, Qika Lin +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

IDEAS Research Institute2w ago·also Warsaw

The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences

LLMs differ most not in personality, but in how they represent themselves as having (or not having) rich internal experience.

Hubert Plisiecki, Sabina Siudaj, Kacper Dudzic +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Utrecht University2w ago

Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals

Attention heads hold the key to detecting LLM hallucinations, offering a lightweight, white-box alternative to expensive sampling or external models.

Gijs van Dijk

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

2w ago·also AIST, Stockmark

Why Expert Alignment Is Hard: Evidence from Subjective Evaluation

Expert alignment is hard not just because of model limitations, but because human subjective evaluation is a moving target.

Tzu-Mi Lin, Wataru Hirota, Tatsuya Ishigaki +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

2w ago·also Ant Group, PolyU

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

TabEmbed leapfrogs existing text embedding models to achieve SOTA performance on tabular data by reformulating tasks as semantic matching problems and using contrastive learning.

Minjie Qiang, Mingming Zhang, Xiaoyi Bao +5

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Charles University2w ago

UFAL-CUNI at SemEval-2026 Task 11: An Efficient Modular Neuro-symbolic Method for Syllogistic Reasoning

Small LLMs paired with symbolic solvers can outperform larger zero-shot LLMs on formal reasoning tasks, but still struggle with multilingual inputs.

Ivan Kartáč, Kristýna Onderková, Jan Bronec +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

2w ago·also CNRS, CREST (, ENSAE, Grenoble INP +3

BenCSSmark: Making the Social Sciences Count in LLM Research

LLM benchmarks are missing a critical ingredient: social science data, which could significantly improve model generalization and robustness across a wide range of disciplines.

Arnault Chatelain, Étienne Ollion, Qianwen Guan +7

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Maria Luongo +22w ago

Measuring Psychological States Through Semantic Projection: A Theory-Driven Approach to Language-Based Assessment

Ditch the black box: This unsupervised semantic projection method rivals supervised models in psychological assessment, offering interpretability and generalizability that supervised methods lack.

Maria Luongo, Davide Marocco, Nicola Milano

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Ge Lei +12w ago

Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations

LLM surrogates in low-data optimization are far more sensitive to prompt engineering and query protocols than previously appreciated, fundamentally altering their beliefs and downstream performance.

Ge Lei, Samuel J. Cooper

Eval Frameworks & Benchmarks Natural Language Processing

Aofan Liu +12w ago

Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs

LLMs can be surprisingly brittle: simply rephrasing a prompt, even while preserving its meaning, can cause them to completely abandon the requested output format.

Aofan Liu, Jingxiang Meng

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

M. Arabov2w ago

Benchmarking POS Tagging for the Tajik Language: A Comparative Study of Neural Architectures on the TajPersParallel Corpus

Even state-of-the-art multilingual models struggle to tag parts-of-speech in Tajik when trained on isolated words, highlighting the critical role of syntactic context.

M. Arabov

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Natural Language Processing

2w ago

DoGMaTiQ: Automated Generation of Question-and-Answer Nuggets for Report Evaluation

Stop hand-crafting QA datasets for evaluating RAG systems: DoGMaTiQ automates the process with surprisingly high correlation to human judgment, even across languages.

Bryan Li, W. Walden, Yu Hou +6

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Rokas Gipivskis +12w ago·also Vilnius University

Evaluation Cards for XAI Metrics

Stop reinventing the wheel (or worse, comparing apples to oranges) in XAI evaluation: a standardized "XAI Evaluation Card" could finally bring clarity and rigor to a fragmented field.

Rokas Gipivskis, O. Kurasova

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

Department of Computer Science2w ago

An Evaluation of Chat Safety Moderations in Roblox

Roblox's chat moderation misses a disturbing amount of grooming, bullying, and other harmful content, despite its reliance on automated systems.

Priyanka Kaushik, Sonja Brown, Rakibul Hasan +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Md Zakir Hossain +42w ago

Assessing Generalisation Capability of Machine Learning Models for Intrusion Detection

Despite achieving high accuracy on individual datasets, machine learning models for intrusion detection exhibit a significant generalization gap, with performance dropping drastically when tested on unseen network environments.

Md Zakir Hossain, Md Ayshik Rahman Khan, Md Rafiqul Islam +2

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Johannes Hartel2w ago

Agentic Repository Mining: A Multi-Task Evaluation

LLM agents that autonomously explore code repositories can match the classification accuracy of simpler LLMs with hand-crafted context, hinting at a future where agents surpass human-labeled data in complex software understanding tasks.

Johannes Hartel

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

2w ago

Patterns of Developer Adoption of LLM-Generated Code Refactoring Suggestions

Developers overwhelmingly trust and directly apply LLM-generated code refactoring suggestions, but when they don't, the changes are surprisingly drastic and predictable.

David Schon, Faiza Amjad, Tehreem Asif +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

2w ago·also Munich Center for Machine Learning (MCML), These authors contributed equally to

A meta-analysis of the effect of generative AI on productivity and learning in programming

GenAI coding assistants boost developer productivity, but the gains shrink outside the lab and don't translate to better learning.

Sebastian Maier, Moritz Gunzenhauser, J. Schweisthal +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

2w ago

How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study

Turns out, chunking code by function is the *worst* way to do retrieval-augmented code completion.

Xinjian Wu, Jingzhi Gong, Gunel Jahangirova +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Recommendation & Information Retrieval

BaseThesis Labs2w ago·also QwikBuild

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

"Vibe coding" platforms promise effortless app creation, but SWE-WebDevBench reveals they often deliver visually appealing frontends with broken backends, struggle with security, and require significant human effort to reach production readiness.

Siddhant Saxena, Nilesh Trivedi, V. Jyothi

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

MIT CSAIL2w ago

Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone

Current alignment benchmarks are misleading: even if a model aces them, its real-world alignment could be totally different depending on the specific deployment context.

Varad V. Vishwarupe, Nigel Shadbolt, M. Jirotka +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Scalable Oversight & Alignment Theory

School of Computer Science2w ago·also Hubei Key Laboratory of Multimedia and Network, Institute of Artificial Intelligence, National Engineering Research Center for Multimedia, WHU

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

Video-LLMs are leaving performance on the table: explicitly anchoring to keyframes before answering questions unlocks significant gains in Video TextVQA.

Haibin He, Maoyuan Ye, Juhua Liu +1

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Cyril Allauzen +42w ago

Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB)

Audio-native LLMs still lag behind cascaded architectures in key audio tasks, suggesting that the multimodal promise of LLMs isn't quite ready for prime time in the sound domain.

Cyril Allauzen, Tom Bagby, G. Heigold +2

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Yupeng Gao +32w ago

UAV as Urban Construction Change Monitor: A New Benchmark and Change Captioning Model

Achieve spatially grounded natural language descriptions of urban development with PTNet, a new model that understands change semantics better than existing methods.

Yupeng Gao, Tianyu Li, Guoqing Wang +1

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Zicheng Zhao +32w ago

From Priors to Perception: Grounding Video-LLMs in Physical Reality

Video-LLMs aren't failing at perception, they're being tricked by their own assumptions, but a new dataset and reasoning chain can fix it.

Zicheng Zhao, Chaofan Gan, Shijie Li +1

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

AeroVironment2w ago·also George Mason University

A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping

Existing restoration methods crumble when faced with the extreme geometric distortions caused by strong refractive warping, highlighting the need for robust new approaches benchmarked on this challenging dataset.

Maxim V. Shugaev, Md Reshad Ul Hoque, Bridget Kennedy +8

Computer Vision Eval Frameworks & Benchmarks

Wei Luo +342w ago

LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)

Current video generation benchmarks overlook crucial aspects of physical plausibility and temporal coherence, highlighting the need for holistic evaluation metrics like PhyScore.

Wei Luo, Yiting Lu, Xin Li +32

Eval Frameworks & Benchmarks Multimodal Models World Models & Planning

May 5, 2026

Qiyao Wang +132w ago

PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination

LLMs struggle to navigate the complex, multi-turn justification and response dynamics of real-world patent examination, revealing critical gaps in legal reasoning and technical novelty judgment.

Qiyao Wang, Qiyao Wang, Xinyi Chen +11

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Jianjie Fang +102w ago

iWorld-Bench: A Benchmark for Interactive World Models with a Unified Action Generation Framework

Current world models struggle with basic physical interaction tasks like distance perception and trajectory following, highlighting a critical gap in their ability to simulate realistic environments.

Jianjie Fang, Yingshan Lei, Qinglin Wan +8

Eval Frameworks & Benchmarks Robotics & Embodied AI World Models & Planning

Zirui Tang +192w ago

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Today's AI agents are surprisingly inept at navigating the messy reality of digital workspaces, failing to reach even 70% accuracy on tasks that require understanding file dependencies.

Zirui Tang, Xuanhe Zhou, Yumou Liu +17

Eval Frameworks & Benchmarks Tool Use & Agents

2w ago

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Forget resource-intensive pipelines: a purely academic team achieves SOTA search agent performance with just 10.6k SFT data points, outperforming models trained with CPT+SFT+RL.

Yuwen Du, Rui Ye, Shuo Tang +4

Eval Frameworks & Benchmarks Open-Source Models & Weights Tool Use & Agents

Joseph Breda +322w ago

SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

LLMs beat doctors at everyday symptom diagnosis, but only when they proactively interview patients instead of passively answering questions.

Joseph Breda, Fadi Yousif, Beszel Hawkins +30

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Zhi Xu +12w ago·also Northeastern

NoisyCausal: A Benchmark for Evaluating Causal Reasoning Under Structured Noise

LLMs struggle with causal reasoning when noise is introduced, but explicitly modeling causal graphs can dramatically improve performance and generalization.

Zhi Xu, Yun Fu

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Stefano Bannò +22w ago

Towards Self-Referential Analytic Assessment: A Profile-Based Approach to L2 Writing Evaluation with LLMs

LLMs are surprisingly good at pinpointing what's *wrong* with student writing, even outperforming human graders in identifying relative weaknesses.

Stefano Bannò, Kate Knill, Mark Gales

Eval Frameworks & Benchmarks Natural Language Processing

2w ago

MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs

Existing hallucination detection methods are missing subtle, word-level medical errors, but a new data-centric pipeline and detector closes the gap by 15%.

Tung Sum Thomas Kwok, Qian Qian, Xiaofeng Lin +8

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Stephen E. Moore +152w ago

Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages

Despite impressive multilingual capabilities, today's LLMs still can't reliably translate between English and Ghanaian languages at scale.

Stephen E. Moore, M. Owusu, Akwasi Asare +13

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Hoffmann Muki +12w ago

Are LLMs Ready for Conflict Monitoring? Empirical Evidence from West Africa

LLMs exhibit a surprising "False Illegitimation bias," systematically misclassifying legitimate battles as violence against civilians, highlighting a critical flaw for conflict monitoring applications.

Hoffmann Muki, Olukunle P. Owolabi

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Humam Khan +42w ago

Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing

LLMs may sound convincing when writing academic content, but they can still confidently fabricate facts and references at surprisingly high rates.

Humam Khan, Md. Tabrez Nafis, S. Sohail +2

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Elitsa Yotkova +42w ago

FMI_SU_Yotkova_Kastreva at SemEval-2026 Task 13: Lightweight Detection of LLM-Generated Code via Stylometric Signals

Forget the heavy transformers: surprisingly effective LLM-generated code detection can be achieved with lightweight stylometric features and decision trees, offering near-instant inference.

Elitsa Yotkova, Violeta Kastreva, D. Dimitrov +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

D. Gringras +12w ago

Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation

LLM benchmarks are increasingly measuring the capabilities of yesterday's models, not today's frontier, creating a widening gap that misrepresents the state of AI.

D. Gringras, Misha Salahshoor

Eval Frameworks & Benchmarks Open-Source Models & Weights

Sebastian Wind +232w ago

Safety and accuracy follow different scaling laws in clinical large language models

Scaling clinical LLMs doesn't guarantee safety: high-risk errors persist even with advanced RAG and max-context prompting, highlighting the critical role of evidence quality and deployment strategy.

Sebastian Wind, Sebastian Wind, Tri-Thien Nguyen +21

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Scaling Laws & Emergent Abilities

Richard J. Young +12w ago·also DeepNeuro AI

EQUITRIAGE: A Fairness Audit of Gender Bias in LLM-Based Emergency Department Triage

LLMs can exhibit gender bias in emergency triage even when well-calibrated, and interventions effective for one model may backfire on another.

Richard J. Young, Alice M. Matthews

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

2w ago

Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments

LLMs' own self-judgments, when logically linked to their response features, can significantly improve hallucination detection.

Hao Mi, Qiang Sheng, Shaofei Wang +7

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

DAMO2w ago

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

Despite impressive OCR performance on existing benchmarks, today's best LMMs still struggle with the messy realities of enterprise document processing.

Zhipeng Xu, Junhao Ji, Zulong Chen +10

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Mohamed F. Mady +22w ago

Feature-Augmented Transformers for Robust AI-Text Detection Across Domains and Generators

Naive application of transformer-based AI-text detectors can be brittle under distribution shift, but attention-based fusion of readability and vocabulary features can significantly improve robustness.

Mohamed F. Mady, Johannes Reschke, Björn W. Schuller

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Daniel Drucker +12w ago

The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models

Language models can play the counterexample game, but their philosophical reasoning hits diminishing returns fast, and they're far more lenient judges than humans.

Daniel Drucker, Kyle Mahowald

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Lisa Adams +102w ago

Atomic Fact-Checking Increases Clinician Trust in Large Language Model Recommendations for Oncology Decision Support: A Randomized Controlled Trial

Clinicians trust AI recommendations nearly 3x more when those recommendations are broken down into verifiable facts linked to source guidelines, blowing traditional explainability out of the water.

Lisa Adams, Linus Marx, E. T. Orberg +8

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Scientific Discovery & Drug Design

2w ago

MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

Even top LLM judges struggle to reliably detect violations of specific constraints in complex instructions, especially when violations are partial or absent, revealing critical blind spots in current evaluation methods.

Jaeyun Lee, Junyoung Koh, Z. Tok +2

Eval Frameworks & Benchmarks Natural Language Processing

Cherkasy State Business College2w ago

TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains

Separating LLMs into a deliberate validation layer, rather than making them an architectural default, can improve trustworthiness and efficiency in agentic AI systems.

Serhii W. Zabolotnii

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

2w ago

Reproducing Complex Set-Compositional Information Retrieval

Neural retrievers, despite their success on standard benchmarks, fail spectacularly when forced to reason about set-theoretic constraints, revealing a reliance on spurious correlations rather than true compositional understanding.

Vincent Degenhart, Dewi Timman, Arjen P. de Vries +2

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Haesung Lee +72w ago

TriBench-Ko: Evaluating LLM Risks in Judicial Workflows

LLMs in Korean judicial workflows are surprisingly prone to hallucination, bias, and inconsistency, especially when retrieving precedents and summarizing jurisprudence.

Haesung Lee, Gyubin Choi, Eun-Ju Lee +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

M. Arabov2w ago

Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus

Forget scaling laws: QLoRA-tuned Mistral 7B crushes other architectures for low-resource Tajik text generation, highlighting the importance of architecture choice in PEFT.

M. Arabov

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Mengchu Li +32w ago

Segmenting Human-LLM Co-authored Text via Change Point Detection

Pinpointing exactly where humans end and LLMs begin in co-authored text is now possible, thanks to a clever adaptation of time-series change point detection.

Mengchu Li, Jin Zhu, Jinglai Li +1

Eval Frameworks & Benchmarks Natural Language Processing

Akshay Syal +42w ago

A Dialogue-Based Framework for Correcting Multimodal Errors in AI-Assisted STEM Education

LLMs struggle with multimodal STEM problems, but a simple dialogue-based intervention can fix 82% of their mistakes without retraining.

Akshay Syal, L. Prince, E. Gultepe +2

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

J. Steinberg +12w ago

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

Innocuous-looking coding tasks, when chained together, trick even the best coding agents into creating exploitable code with alarming frequency.

J. Steinberg, Oren Gal

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Haoyu Zhang +22w ago

Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

LLM safety filters, which rely on semantic pattern matching, can be bypassed at scale by encoding harmful prompts as coherent mathematical problems, revealing a fundamental vulnerability.

Haoyu Zhang, Mohammad Zandsalimy, Shanu Sushmita

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

2w ago

ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection

Existing defenses crumble when LLM agents face prompt injections that adapt to dynamic context, but ARGUS offers a robust solution by tracking the provenance of agent decisions.

Shihao Weng, Yang Feng, Jinrui Zhang +3

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago

SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

LLM agent skills are needlessly brittle and insecure: SkCC compiles them into a portable, hardened format that boosts performance by 50% and proactively blocks attacks.

Yipeng Ouyang, Yingjiao Xiao, Yuhao Gu +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Toufique Ahmed +32w ago

Reproduction Test Generation for Java SWE Issues

Java developers drowning in unfixed bugs, rejoice: automated reproduction test generation is now a viable option, thanks to a new benchmark and adapted generator.

Toufique Ahmed, Jatin Ganhotra, Avraham Shinnar +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Shinas Shaji +32w ago

Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior

LLMs spontaneously exhibit collaborative behaviors like perspective-taking and theory of mind in embodied settings, suggesting a surprising capacity for modeling human collaborators without explicit training.

Shinas Shaji, Teena Hassan, Sebastian Houben +1

Eval Frameworks & Benchmarks Robotics & Embodied AI Tool Use & Agents

2w ago

SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison

Forget running the full gauntlet: just 4-5 workloads from SPEC CPU2026 can accurately mirror the entire suite, slashing evaluation costs without sacrificing fidelity.

Ruihao Li, A. Jacob, N. Yadwadkar +1

Distributed Systems & Hardware Eval Frameworks & Benchmarks

Daniel C. Elton +12w ago

Benchmarking open-source tools for in silico antiviral drug discovery

Public antiviral drug discovery datasets are riddled with errors that can be fixed with careful polyprotein splitting, unlocking significant performance gains in binding affinity prediction.

Daniel C. Elton, Preston W. Estep

Eval Frameworks & Benchmarks Open-Source Models & Weights Scientific Discovery & Drug Design

Busayo Awobade +22w ago

AfriVox-v2: A Domain-Verticalized Benchmark for In-the-Wild African Speech Recognition

Modern speech models struggle to generalize to noisy, domain-specific African speech, highlighting a critical gap for localized voice AI.

Busayo Awobade, Gabrial Zencha Ashungafac, Tobi Olatunji

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Jing Qiu +22w ago

SURE-RAG: Sufficiency and Uncertainty-Aware Evidence Verification for Selective Retrieval-Augmented Generation

RAG systems can now reduce unsafe answers by 37% using SURE-RAG, a transparent evidence verification method that outperforms even GPT-4o in controlled sufficiency tasks.

Jing Qiu, Zeyu Han, Chengen Huang

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Gehao Zhang +12w ago

POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

LLMs can generate formally correct postconditions for code, but they often miss crucial details, especially in complex, real-world scenarios.

Gehao Zhang, Juan Zhai

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

UW2w ago

ProgramBench: Can Language Models Rebuild Programs From Scratch?

LLMs can't rebuild software from scratch, even for widely used programs like FFmpeg and SQLite, revealing a critical gap in their ability to make high-level software architecture decisions.

John Yang, K. Lieret, J. Ma +9

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

May 4, 2026

2w ago

AcademiClaw: When Students Set Challenges for AI Agents

Today's best AI agents can only solve 55% of real-world academic tasks that university students find challenging, revealing a significant gap between current AI capabilities and the demands of academic workflows.

Junjie Yu, Pengrui Lu, Weiye Si +75

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Stanford HAI2w ago

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Current LLM agents are woefully inadequate for real-world clinical tasks, achieving only 46% success on a new benchmark that demands long-horizon reasoning and verifiable execution within electronic health records.

Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler +10

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Georg-August-Universität Göttingen /2w ago

A Treasure Trove of Performance: Analyzing the IO500 Submission Data

HPC storage benchmarks hide a wealth of insights into filesystem-specific overheads and load imbalances, if you're willing to dig into the logs.

Julian Kunkel, Aasish Kumar Sharma, Anila Ghazanfar +2

Distributed Systems & Hardware Eval Frameworks & Benchmarks

Posts2w ago·also Telecommunications Institute of Technology

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

Existing deepfake detectors crumble when faced with realistic, multi-region speech inpainting, leaving a gaping vulnerability that this work begins to address.

Tung Vu, Yen Nguyen, Hai Nguyen +2

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Speech & Audio

ETH2w ago·also UZH

When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

Despite the promise of multimodal context, current audio-language models struggle to leverage clinical information for dysarthric speech recognition, even degrading performance in some cases.

Pehuén Moure, Niclas Pokel, Bilal Bounajma +4

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

2w ago

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

Autonomous agents can produce plausible-sounding research that's subtly wrong, so ARIS uses adversarial collaboration between different LLMs to catch these errors.

Ruofeng Yang, Yongcan Li, Shuai Li

Eval Frameworks & Benchmarks Open-Source Models & Weights Tool Use & Agents