March 25 – April 1, 2026

Eval Frameworks & Benchmarks - Weekly Roundup

100 papers published across 5 labs.

14% acceleration

Selected Labs publishing this week

Tsinghua AI3 DAMO3 Google Research2 ETH1 Stanford HAI1

Top Papers

Mar 31, 2026

Peng Gang1d ago

Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect

Even Gemini can understand you if you speak its language: structured intent prompting slashes cross-language performance variance and boosts weaker models more than stronger ones.

Peng Gang

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Jonas Landsgesell +11d ago

ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules

Tabular foundation model performance hinges on the evaluation metric, revealing that no single pretraining objective is universally optimal across different risk profiles.

Jonas Landsgesell, Pascal Knoll

Eval Frameworks & Benchmarks Open-Source Models & Weights

Benjamin Josef Schüßler +11d ago

Towards Empowering Consumers through Sentence-level Readability Scoring in German ESG Reports

Forget complex LLMs: a small, fine-tuned transformer surprisingly nails readability scoring for German ESG reports.

Benjamin Josef Schüßler, Jakob Prange

Eval Frameworks & Benchmarks Natural Language Processing

Qiucheng Yu +61d ago

TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios

Current vision-language models are surprisingly bad at identifying common household safety hazards, but a new benchmark could change that.

Qiucheng Yu, Ruijie Xu, Mingang Chen +4

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Zhenning Chen +61d ago

KEditVis: A Visual Analytics System for Knowledge Editing of Large Language Models

Stop guessing which layers to edit in your LLM – KEditVis reveals the inner workings of knowledge editing, letting you pinpoint the most effective interventions.

Zhenning Chen, Hanbei Zhan, Yanwei Huang +4

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

All Papers (100)

Mar 31, 2026

Peng Gang1d ago

Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect

Even Gemini can understand you if you speak its language: structured intent prompting slashes cross-language performance variance and boosts weaker models more than stronger ones.

Peng Gang

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Jonas Landsgesell +11d ago

ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules

Tabular foundation model performance hinges on the evaluation metric, revealing that no single pretraining objective is universally optimal across different risk profiles.

Jonas Landsgesell, Pascal Knoll

Eval Frameworks & Benchmarks Open-Source Models & Weights

Benjamin Josef Schüßler +11d ago

Towards Empowering Consumers through Sentence-level Readability Scoring in German ESG Reports

Forget complex LLMs: a small, fine-tuned transformer surprisingly nails readability scoring for German ESG reports.

Benjamin Josef Schüßler, Jakob Prange

Eval Frameworks & Benchmarks Natural Language Processing

Qiucheng Yu +61d ago

TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios

Current vision-language models are surprisingly bad at identifying common household safety hazards, but a new benchmark could change that.

Qiucheng Yu, Ruijie Xu, Mingang Chen +4

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Zhenning Chen +61d ago

KEditVis: A Visual Analytics System for Knowledge Editing of Large Language Models

Stop guessing which layers to edit in your LLM – KEditVis reveals the inner workings of knowledge editing, letting you pinpoint the most effective interventions.

Zhenning Chen, Hanbei Zhan, Yanwei Huang +4

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Richard Servajean +11d ago

Measuring the metacognition of AI

LLMs can be rigorously evaluated for metacognitive abilities like confidence assessment and risk-aware decision-making using psychophysical frameworks borrowed from human cognition research.

Richard Servajean, Philippe Servajean

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Christopher Koch1d ago

Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupling and the Limits of the Dunning-Kruger Metaphor

LLMs don't just make people confidently wrong; they create a dangerous illusion of competence by decoupling performance from actual understanding.

Christopher Koch

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Dustin Eisenhardt +21d ago

Mind the Gap: A Framework for Assessing Pitfalls in Multimodal Active Learning

Multimodal AI models learn to be lazy, often ignoring entire modalities, and current active learning methods don't fix the problem.

Dustin Eisenhardt, Yunhee Jeong, Florian Buettner

Eval Frameworks & Benchmarks Multimodal Models Training Efficiency & Optimization

Jiao Chen +31d ago

6GAgentGym: Tool Use, Data Synthesis, and Agentic Learning for Network Management

An 8B open-source model, trained with a new closed-loop environment for 6G network management, achieves performance comparable to GPT-4, suggesting a viable path to autonomous network control.

Jiao Chen, Jianhua Tang, Xiaotong Yang +1

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Tool Use & Agents

Yang Shen +61d ago

An Empirical Study of Multi-Agent Collaboration for Automated Research

Multi-agent systems for automated research face a fundamental trade-off: parallel exploration offers speed and stability, while expert teams unlock deeper reasoning at the cost of increased fragility.

Yang Shen, Zhenyi Yi, Ziyi Zhao +4

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design Tool Use & Agents

Steven Y. Feng +21d ago

Baby Scale: Investigating Models Trained on Individual Children's Language Input

Training language models on individual children's language reveals that distributional and interactional linguistic features, not just dataset size, are key to efficient learning, mirroring factors that drive child language acquisition.

Steven Y. Feng, Alvin W. M. Tan, Michael C. Frank

Eval Frameworks & Benchmarks Natural Language Processing Scaling Laws & Emergent Abilities

Alain Vázquez +11d ago

Impact of enriched meaning representations for language generation in dialogue tasks: A comprehensive exploration of the relevance of tasks, corpora and metrics

Enriching meaning representations with task demonstrators can significantly boost dialogue generation, especially in challenging scenarios, revealing a simple yet effective strategy for improving NLG performance.

Alain Vázquez, Maria Inés Torres

Eval Frameworks & Benchmarks Natural Language Processing

Mohammad Mesgari +41d ago

Structural Compactness as a Complementary Criterion for Explanation Quality

Forget IoU, measuring the structural compactness of attribution maps with Minimum Spanning Trees reveals fundamental differences in how models explain themselves.

Mohammad Mesgari, Jackie Ma, Wojciech Samek +2

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

Luoxin Chen +21d ago

Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries

Reward LLMs for verifiable reasoning steps, not just correct answers, to get more reliable multi-step logic.

Luoxin Chen, Yichi Zhou, Huishuai Zhang

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

Ziliang Guo +21d ago

MemFactory: Unified Inference&Training Framework for Agent Memory

Stop cobbling together memory-augmented agents: MemFactory offers a unified "Lego-like" framework that streamlines training and boosts performance by up to 14.8%.

Ziliang Guo, Ziheng Li, Zhiyu Li

Eval Frameworks & Benchmarks RLHF & Preference Learning Tool Use & Agents

Seung-Hun Han +21d ago

M-MiniGPT4: Multilingual VLLM Alignment via Translated Data

Multilingual vision-language models can achieve surprisingly strong performance (36% on MMMU) simply by training on translated data and aligning with parallel text corpora.

Seung-Hun Han, Youssef Mohamed, Mohamed Elhoseiny

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

University of Pavia1d ago

Security in LLM-as-a-Judge: A Comprehensive SoK

LLM-as-a-Judge, while improving evaluation scalability, introduces critical security vulnerabilities that can compromise the trustworthiness of entire evaluation pipelines.

Aiman Almasoud, A.B. Anju, Marco Arazzi +6

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Corresponding authors1d ago

Hallucination-aware intermediate representation edit in large vision-language models

Correcting a vision-language model's "hallucinations" is now as simple as pinpointing and editing the right intermediate representation, sidestepping costly retraining or dual inference.

Wei Suo, Hanzu Zhang, Lijun Zhang +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

1d ago·also ETH, UIUC

ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

AI agents are far better at automating data engineering tasks than previously thought, but flawed benchmarks are obscuring their true potential.

Andrea Giovannini, Tengjun Jin, Yotam Perlitz

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Moiz Sadiq Awan +11d ago

AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding

LLMs can nail the clinical content of prior authorization letters, but consistently fumble the administrative details that actually get them approved.

Moiz Sadiq Awan, Maryam Raza

Eval Frameworks & Benchmarks Natural Language Processing

Thomas S Sha +11d ago

BenchScope: How Many Independent Signals Does Your Benchmark Provide?

AI benchmarks may be giving you a false sense of comprehensive evaluation: the six scores on the Open LLM Leaderboard effectively boil down to just two independent measurements.

Thomas S Sha, Stella Zhao

Eval Frameworks & Benchmarks Natural Language Processing

1d ago·also Tsinghua AI, PKU

PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent

Today's best smartphone GUI agents stumble when faced with the messy reality of personalized user workflows, achieving only limited success on a new benchmark designed to mimic real-world use.

Hongyi Nie, Xunyuan Liu, Yudong Bai +4

Eval Frameworks & Benchmarks Tool Use & Agents

Ming-Hua Tsai +11d ago

Reward-Based Online LLM Routing via NeuralUCB

NeuralUCB can slash LLM inference costs while maintaining quality, offering a practical alternative to always using the biggest, most expensive models.

Ming-Hua Tsai, Phat Tran

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Mohammad Mohammadamini1d ago

FLEURS-Kobani: Extending the FLEURS Dataset for Northern Kurdish

Northern Kurdish finally gets its due with FLEURS-Kobani, a new benchmark dataset that exposes the challenges and opportunities for ASR and speech translation in this under-resourced language.

Mohammad Mohammadamini

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Adar Avsian +11d ago

SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models

LLMs are surprisingly bad at strategic communication, leaking sensitive information even when trying to be secretive.

Adar Avsian, Larry Heck

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Ella Rabinovich +31d ago

Near-Miss: Latent Policy Failure Detection in Agentic Workflows

Current evaluation methods miss 8-17% of agentic workflow failures because they only check final outcomes, overlooking cases where agents bypass policy checks but still reach the right answer.

Ella Rabinovich, David Boaz, Naama Zwerdling +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Baoyi Zeng +11d ago

Authorship Impersonation via LLM Prompting does not Evade Authorship Verification Methods

LLM-generated authorial impersonations, despite their sophistication, are surprisingly detectable by existing authorship verification methods, even outperforming on some genuine negative samples.

Baoyi Zeng, Andrea Nini

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Robinson Ferrer +31d ago

When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment

Forget fancy ensembling – simply asking an LLM how confident it is in its grading is the most reliable way to predict its accuracy, and it's far cheaper than self-consistency voting.

Robinson Ferrer, Damla Turgut, Zhongzhou Chen +1

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Hailay Kidu Teklehaymanot +21d ago

LLM Probe: Evaluating LLMs for Low-Resource Languages

LLMs may ace English, but LLM Probe reveals surprising performance disparities in low-resource languages, with sequence-to-sequence models unexpectedly leading in morphosyntax.

Hailay Kidu Teklehaymanot, Gebrearegawi Gebremariam, Wolfgang Nejdl

Eval Frameworks & Benchmarks Natural Language Processing

Yahan Li +71d ago

CounselReflect: A Toolkit for Auditing Mental-Health Dialogues

Mental-health support chatbots get a much-needed reality check with CounselReflect, a toolkit that exposes their strengths and weaknesses through transparent, multi-dimensional audits.

Yahan Li, Chaohao Du, Zeyang Li +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Zoë Prins +41d ago

Is my model perplexed for the right reason? Contrasting LLMs'Benchmark Behavior with Token-Level Perplexity

LLMs ace linguistic benchmarks, but a token-level perplexity analysis reveals they're often relying on the wrong cues.

Zoë Prins, Samuele Punzo, Frank Wildenburg +2

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Yahan Li +71d ago

Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations

LLMs struggle to handle common, challenging patient behaviors like contradictory statements and inaccurate medical information, revealing critical safety gaps in medical consultation applications.

Yahan Li, Xinyi Jie, Wanjia Ruan +5

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Ona de Gibert +11d ago

Open Machine Translation for Esperanto

Despite its simple grammar, Esperanto translation still poses challenges for LLMs, with NLLB models only preferred in about half of human evaluations.

Ona de Gibert, Llu'is de Gibert

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

S. Higashiyama +21d ago

CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking

Japanese entity linking gets a boost: CADEL offers a high-quality, Japan-specific corpus to tackle the unique challenges of linking entities in administrative web documents.

S. Higashiyama, Masao Ideuchi, Masao Utiyama

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Lukuang Dong +41d ago

Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition

LLMs can achieve state-of-the-art multilingual speech recognition by smartly handling noisy phoneme inputs, even with severe data imbalance across languages.

Lukuang Dong, Ziwei Li, Saierdaer Yusuyin +2

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Zhiyuan Peng +41d ago

MemRerank: Preference Memory for Personalized Product Reranking

Forget clunky prompt engineering: distilling user history into a learned preference memory boosts LLM-based product reranking by over 10%.

Zhiyuan Peng, Xuyang Wu, Huaixiao Tou +2

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Tool Use & Agents

Zhuowen Liang +51d ago

Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs

Forget slow, bloated LLMs – this work shows you can get GPT-4o quality on long-document QA with a 3B model and a clever structure-first distillation approach.

Zhuowen Liang, Xiaotian Lin, Zhengxuan Zhang +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Zhiqian Zhang +71d ago

Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

You don't need a massive model to beat Gemini-2.5-Pro in real-world content moderation: Xuanwu VL-2B achieves superior recall on policy-violating text using only 2B parameters.

Zhiqian Zhang, Xu Zhao, Xiaoqing Xu +5

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Iordanis Fostiropoulos +101d ago

GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

LLMs still struggle to accurately infer user interests from interaction histories, especially when dealing with diverse engagement signals – a critical gap for effective personalization.

Iordanis Fostiropoulos, Muhammad Azhar, Abdalaziz Sawwan +8

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Iulian Lucău +11d ago

Can Commercial LLMs Be Parliamentary Political Companions? Comparing LLM Reasoning Against Romanian Legislative Expuneri de Motive

LLMs can mimic legislative reasoning, but their performance hinges on the proposal's idiosyncrasy, revealing a susceptibility to plausible-sounding confabulation that could mislead policymakers.

Iulian Lucău, Adelin-George Voicu

Eval Frameworks & Benchmarks Open-Source Models & Weights Reasoning & Chain-of-Thought

Andrew G. Ross +11d ago

AI-Simulated Expert Panels for Socio-Technical Scenarios and Decision Guidance

Forget resource-intensive workshops – AI can now simulate entire expert panels to generate and stress-test socio-technical scenarios, opening doors to rapid policy exploration.

Andrew G. Ross, Allan Ross

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

Danielle R. Thomas +41d ago

Modernizing Ground Truth: Four Shifts Toward Improving Reliability and Validity in AI in Education

Stop treating inter-rater reliability as a simple green light for "ground truth" in AIED – your data's probably messier than you think, especially with LLMs in the mix.

Danielle R. Thomas, Conrad Borchers, Kirk Vanacore +2

Constitutional AI & AI Ethics Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Tsinghua AI1d ago·also NJU, PKU

Benchmarking PhD-Level Coding in 3D Geometric Computer Vision

GPT-5 can only solve 37% of PhD-level 3D geometry coding problems, suggesting AI can't reliably automate complex scientific coding tasks yet.

Wenyi Li, Renkai Luo, Yue Yu +5

Code Generation & Program Synthesis Computer Vision Eval Frameworks & Benchmarks

Gaurab Baral +11d ago

AutoFormBench: Benchmark Dataset for Automating Form Understanding

YOLOv11 crushes the competition in form element detection, showcasing its potential for automating document processing across diverse real-world forms.

Gaurab Baral, Junxiu Zhou

Computer Vision Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Fengjian Xue +101d ago·also Corresponding author

FED-Bench: A Cross-Granular Benchmark for Disentangled Evaluation of Facial Expression Editing

Current facial expression editing models can't simultaneously preserve identity and accurately manipulate expressions, revealing a critical need for better fine-grained instruction following.

Fengjian Xue, Xuecheng Wu, Heli Sun +8

Computer Vision Eval Frameworks & Benchmarks

1d ago

Assessing Multimodal Chronic Wound Embeddings with Expert Triplet Agreement

Expert ordinal comparisons reveal that fusing vision and language in wound representation learning boosts agreement by 5.6% over unimodal foundation models for a rare genetic skin disorder.

Fabian Kabus, Julia Hindel, Jelena Bratuli'c +6

Eval Frameworks & Benchmarks Multimodal Models Scientific Discovery & Drug Design

Payal Fofadiya +11d ago

Developing Adaptive Context Compression Techniques for Large Language Models (LLMs) in Long-Running Interactions

LLMs can maintain conversational stability and improve retrieval accuracy in long-running interactions by adaptively compressing context, leading to reduced token usage and faster inference.

Payal Fofadiya, Sunil Tiwari

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Inference & Quantization

Sunil Tiwari +11d ago

Multi-Layered Memory Architectures for LLM Agents: An Experimental Evaluation of Long-Term Context Retention

Dialogue agents can now remember what you told them six turns ago with 57% accuracy, thanks to a new memory architecture that selectively forgets less important details.

Sunil Tiwari, Payal Fofadiya

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Tool Use & Agents

1d ago·also V evaluation systems. Numerous

SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

Current text-to-long-video evaluation metrics can't reliably assess video quality, failing to match human judgment in 9 out of 10 tested degradation aspects.

Ryosuke Matsuda, Keito Kudo, Haruto Yoshida +2

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Muhamed Ramees Cheriya Mukkolakkal1d ago

1.5 Million Messages Per Second on 3 Machines: Benchmarking and Latency Optimization of Apache Pulsar at Enterprise Scale

Unexplained P99.9 latency spikes in Apache Pulsar could be due to a previously undocumented Linux kernel page cache writeback interaction inside BookKeeper's ForceWriteThread, even with dedicated NVMe drives.

Muhamed Ramees Cheriya Mukkolakkal

Distributed Systems & Hardware Eval Frameworks & Benchmarks

1d ago·also IIT Delhi, Indraprastha Institute of Information, Jaypee Institute of Information

Audio Hallucination Attacks: Probing the Reliability of Large Audio Language Models

State-of-the-art Large Audio Language Models are surprisingly vulnerable to hallucination attacks, with success rates as high as 95%, revealing a critical reliability gap masked by standard benchmarks.

Ashish Seth, Sonal Kumar, Ramaneswaran Selvakumar +5

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Speech & Audio

Y. E. Kheir +71d ago

IQRA 2026: Interspeech Challenge on Automatic Assessment Pronunciation for Modern Standard Arabic (MSA)

Arabic mispronunciation detection just got a whole lot better: F1-scores jumped by 0.28 thanks to novel architectures and a new dataset of authentic mispronunciations.

Y. E. Kheir, Amit Meghanani, Mostafa Shahin +5

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

1d ago·also Baidu, Leiden, UvA

Cold-Starts in Generative Recommendation: A Reproducibility Study

Generative recommendation's touted cold-start abilities often vanish under rigorous testing, revealing a sensitivity to design choices that current benchmarks fail to capture.

Zhen Zhang, Jujia Zhao, Xinyu Ma +3

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Archish S +41d ago

On Strengths and Limitations of Single-Vector Embeddings

Single-vector embeddings' retrieval failures aren't just about dimensionality; they're fundamentally hobbled by domain shift, relevance misalignment, and a "drowning" effect that multi-vector models handle far better.

Archish S, Mihir Agarwal, Ankit Garg +2

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

DAMO1d ago

ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation

MLLMs struggle to plan coherent interleaved text-and-image generation, often missing opportunities for tool use, revealing a critical gap in their ability to unify factuality with creativity.

Yinuo Liu, Heng Zhou, Jiahao Zhang +1

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

1d ago

Video-Oasis: Rethinking Evaluation of Video Understanding

Over half of video understanding benchmark samples are solvable without watching the video, and current models barely outperform random guessing on the rest.

Geuntaek Lim, Minho Shim, Sungjune Park +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Fumihiko Tsuchiya +61d ago

EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos

Current multimodal LLMs struggle to count objects and ground evidence in videos longer than 30 minutes, achieving only ~25% accuracy compared to human performance on a new benchmark.

Fumihiko Tsuchiya, Taiki Miyanishi, Mahiro Ukai +4

Computer Vision Eval Frameworks & Benchmarks

Yunrui Yu +71d ago

Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses

Dummy Class defenses, which appear robust under standard adversarial attacks, crumble when attacked with a novel DAWA method that targets both the true and dummy labels.

Yunrui Yu, Xuxiang Feng, Peng Qin +5

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Mar 30, 2026

Khalid Adnan Alsayed2d ago

Why Aggregate Accuracy is Inadequate for Evaluating Fairness in Law Enforcement Facial Recognition Systems

Aggregate accuracy can be dangerously misleading when evaluating facial recognition systems for law enforcement, obscuring significant disparities in error rates across demographic subgroups.

Khalid Adnan Alsayed

Computer Vision Constitutional AI & AI Ethics Eval Frameworks & Benchmarks

Min Wang +12d ago

AMIGO: Agentic Multi-Image Grounding Oracle Benchmark

Current vision-language benchmarks miss the mark: AMIGO reveals how hard it is for agents to ground visual information across multiple images and turns.

Min Wang, Ata Mahjoubfar

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Doan Nam Long Vu +12d ago

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

VLMs can appear to gain up to 58% F1 on clinical tasks simply by *mentioning* MRI data in the prompt, even when the data is uninformative, revealing a "scaffold effect" that inflates performance metrics.

Doan Nam Long Vu, Simone Balloccu

Eval Frameworks & Benchmarks Multimodal Models Open-Source Models & Weights

Chanyoung Kim +42d ago·also Soongsil University

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

VLA models are brittle: even simple synonym substitutions in instructions cause a 22-52% performance drop in robotic manipulation tasks.

Chanyoung Kim, Minwoo Kim, Minseok Kang +2

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Han Wang +102d ago·also State Key Laboratory for Novel Software

MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

LLMs can strategically obfuscate their reasoning, with chain-of-thought monitorability dropping by up to 30% under stress tests, particularly when tasks don't demand explicit reasoning.

Han Wang, Yifan Sun, Brian Ko +8

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Adam Laabs2d ago

T-Norm Operators for EU AI Act Compliance Classification: An Empirical Comparison of Lukasiewicz, Product, and Gödel Semantics in a Neuro-Symbolic Reasoning System

Choosing the right fuzzy logic operator for AI compliance can mean the difference between accurate risk assessment and costly false positives, but the completeness of the rule base matters more.

Adam Laabs

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Kangkang Sun +52d ago

CoE: Collaborative Entropy for Uncertainty Quantification in Agentic Multi-LLM Systems

Semantic disagreement between LLMs reveals crucial uncertainty that single-model metrics miss, and Collaborative Entropy (CoE) captures it.

Kangkang Sun, Jun Wu, Jianhua Li +3

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Thomas Van Mullem +22d ago

Evaluating LLMs for Answering Student Questions in Introductory Programming Courses

Gemini 3 flash can answer introductory programming questions better than typical educators, suggesting a path to scalable, personalized feedback in CS1 courses.

Thomas Van Mullem, Bart Mesuere, Peter Dawyndt

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Tsinghua AI2d ago

CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

Forget brute-force search: CoT2-Meta shows that strategically controlling reasoning trajectories with metacognition yields significant gains in accuracy and compute efficiency across a wide range of reasoning tasks.

Siyuan Ma, Bo Gao, Zikai Xiao +5

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Core contribution2d ago

MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios

Open-source document parsing models are shockingly brittle, losing nearly 18% accuracy on real-world photos and 14% on non-Latin scripts compared to their closed-source counterparts.

Zhang Li, Zhibo Lin, Qiang Liu +6

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

2d ago·also Myongji University

When Choices Become Priors: Contrastive Decoding for Scientific Figure Multiple-Choice QA

Scientific figure QA models are often fooled by the answer choices themselves, but a simple decoding strategy that contrasts image-grounded scores with text-only scores can significantly improve accuracy.

Taeyun Roh, Eun-yeong Jo, Wonjune Jang +1

Eval Frameworks & Benchmarks Multimodal Models Scientific Discovery & Drug Design

Yuang Wei +22d ago·also Corresponding author()

SLOW: Strategic Logical-inference Open Workspace for Cognitive Adaptation in AI Tutoring

LLM tutors can become significantly more personalized, emotionally sensitive, and clear by explicitly separating learner-state inference from instructional action selection.

Yuang Wei, Ruijia Li, Bo Jiang

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Stanford HAI2d ago·also KRAFTON

Meta-Harness: End-to-End Optimization of Model Harnesses

Stop hand-coding your LLM harnesses: Meta-Harness can automatically discover harnesses that outperform state-of-the-art systems while using fewer context tokens and generalizing across models.

Yoonho Lee, Roshen Nair, Qizheng Zhang +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Islamic University of Technology (IUT)2d ago

From Reviews to Requirements: Can LLMs Generate Human-Like User Stories?

LLMs can now reliably transform messy app store reviews into well-formatted user stories, but still fall short of creating truly independent and unique requirements for agile development.

Shadman Sakib, Oishy Fatema Akhand, Tasnia Tasneem +1

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Xinran Zhang2d ago

Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation

Atomic decomposition, a popular technique for LLM judges, may not be superior to holistic evaluation when prompts are carefully controlled, challenging the assumption that breaking down answers into claims is always beneficial.

Xinran Zhang

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

2d ago

Who Wrote the Book? Detecting and Attributing LLM Ghostwriters

You can now unmask LLM ghostwriters with a lightweight fingerprinting method that works even when they try to hide in new domains or use unseen models.

Anudeex Shetty, Qiongkai Xu, Olga Ohrimenko +1

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Kesheng Chen +42d ago

CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

Even state-of-the-art vision-language models still struggle to reconcile visual evidence with commonsense, often hallucinating based on prior knowledge instead of what they actually see.

Kesheng Chen, Yamin Hu, Qi Zhou +2

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Zhuoshang Wang +42d ago·also Corresponding author

EnsemJudge: Enhancing Reliability in Chinese LLM-Generated Text Detection through Diverse Model Ensembles

A novel ensemble method substantially improves the reliability of detecting Chinese LLM-generated text, even against adversarial examples.

Zhuoshang Wang, Yubing Ren, Guoyu Zhao +2

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Google Research2d ago

Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey

MLLMs are riddled with shared vulnerabilities across modalities, meaning a single weakness can be exploited to jailbreak safety filters, hijack instructions, or even poison training data.

Bhavuk Jain, Sercan Ö. Arik, Sercan Ö. Arık +2

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Kaushitha Silva +32d ago

BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations

LLMs can generate better code by treating tests as noisy signals to be refined, rather than ground truth, unlocking performance gains even with smaller models.

Kaushitha Silva, Kaushitha Silva, Srinath Perera +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

2d ago·also Kristiania University of Applied

Detecting and Mitigating Flakiness in REST API Fuzzing

REST API fuzzing, a critical component of modern software development, suffers from significant flakiness issues that can now be reliably detected and mitigated.

Man Zhang, Chongyang Shen, Andrea Arcuri +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

2d ago·also HUST

Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild

AI coding assistants are racking up technical debt in real-world projects, with nearly a quarter of the code quality issues they introduce sticking around long-term.

Yue Liu, Ratnadira Widyasari, Yanjie Zhao +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Yu Sun +182d ago

ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation

Current robot manipulation benchmarks fail to capture the messy reality of real-world deployment, so this work introduces a new benchmark, ManipArena, to close the sim2real gap.

Yu Sun, Meng Cao, Ping Yang +16

Eval Frameworks & Benchmarks Robotics & Embodied AI World Models & Planning

Jiho Park +52d ago

SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering

Finally, a way to measure how efficiently a sketch conveys meaning, moving beyond simple recognition accuracy.

Jiho Park, Sieun Choi, Jaeyoon Seo +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Kun Tang +82d ago

DinoDental: Benchmarking DINOv3 as a Unified Vision Encoder for Dental Image Analysis

DINOv3, a vision foundation model trained on general images, surprisingly excels at dental image analysis, especially for the notoriously difficult task of intraoral image understanding.

Kun Tang, Xinquan Yang, Mianjie Zheng +6

Computer Vision Eval Frameworks & Benchmarks Scientific Discovery & Drug Design

DAMO2d ago·also HKU

AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation

VLMs struggle to create logically consistent academic illustrations, with performance gaps between models being far wider than on general image generation tasks.

Zhaohe Liao, Kaixun Jiang, Zhihang Liu +11

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

DAMO2d ago·also Fudan

CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

LLMs may ace synthetic benchmarks, but they fumble the efficiency test in real-world cloud service scenarios, revealing a critical gap in their readiness for customer-facing applications.

Yi Yu, Guangquan Hu, Chenghuang Shen +5

Eval Frameworks & Benchmarks Tool Use & Agents

Zhangqi Jiang +92d ago

GEditBench v2: A Human-Aligned Benchmark for General Image Editing

Image editing benchmarks are broken: even GPT-4 is worse than the new PVC-Judge model at assessing visual consistency in edited images.

Zhangqi Jiang, Zheng Sun, Xianfang Zeng +7

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Bin Zhu +82d ago

Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design

Verification is the secret sauce: an 8B parameter research agent, fortified with verification mechanisms, can now rival or surpass the performance of 30B parameter agents while drastically reducing computational cost.

Bin Zhu, Qianghuai Jia, Tian Lan +6

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Aizirek Turdubaeva +32d ago

Dual Perspectives in Emotion Attribution: A Generator-Interpreter Framework for Cross-Cultural Analysis of Emotion in LLMs

LLMs struggle to attribute emotions across cultures, and where an emotion *originates* matters more than where it's *interpreted*.

Aizirek Turdubaeva, A. Turdubaeva, Uichin Lee +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Daban Q. Jaff +12d ago

From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories

Sentiment models often disagree on Holocaust oral histories, not on the presence of positive or negative sentiment, but on the boundary of neutrality, revealing a critical gap in their ability to handle nuanced historical narratives.

Daban Q. Jaff, Daban Q. Jaff

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Yubo Li +42d ago

The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

LLMs are surprisingly bad at reasoning about everyday scenarios, consistently choosing nonsensical actions (like walking to a car wash) because they're overly influenced by simple heuristics like distance, even when it violates obvious constraints.

Yubo Li, Lu Zhang, Tianchong Jiang +2

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Google Research2d ago·also Institute of Philosophy, Joint last authors., Northwestern, SFI +1

Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

Safety fine-tuning might inadvertently be stripping LLMs of their ability to understand non-human minds and entertain spiritual beliefs, even while preserving Theory of Mind.

Junsol Kim, Winnie Street, R. Rocca +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

IIT2d ago·also L3S Research Center

Known Intents, New Combinations: Clause-Factorized Decoding for Compositional Multi-Intent Detection

Simple factorization beats BERT at generalizing to unseen combinations of intents, but only if you evaluate it the right way.

Abhilash Nandy

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Natural Language Processing

Pavel Šindelář +92d ago

Training data generation for context-dependent rubric-based short answer grading

Generating synthetic training data from limited confidential datasets can produce datasets that are superficially similar to the reference data and improve model training for short answer grading.

Pavel Šindelář, Pavel vSindel'avr, Dávid Slivka +7

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Fangda Ye +252d ago·also SDU

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Current research agent benchmarks miss crucial aspects of real-world research, like multimodal reasoning and iterative refinement, which MiroEval now captures.

Fangda Ye, Yuxin Hu, Peng Zhu +23

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Vrije Universiteit Amsterdam2d ago·also Leiden, TU Delft

Not All Subjectivity Is the Same! Defining Desiderata for the Evaluation of Subjectivity in NLP

Current NLP evaluations miss crucial aspects of subjectivity, potentially leading to models that fail to represent diverse perspectives effectively.

Urja Khurana, Michiel van der Meer, Enrico Liscio +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

2d ago

The Necessity of Setting Temperature in LLM-as-a-Judge

LLM-as-a-Judge accuracy hinges on temperature settings, revealing a task-dependent sweet spot that defies the common practice of fixed values like 0.1 or 1.0.

Lujun Li, Lama Sleem, Yangjie Xu +5

Eval Frameworks & Benchmarks Natural Language Processing

2d ago

Coherent Without Grounding, Grounded Without Success: Observability and Epistemic Failure

LLMs can be confidently wrong about *why* they succeed, and accurately explain failures they can't fix, revealing a fundamental disconnect between explanation and competence.

Camilo Chacón Sartori

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Parham Pourdavood2d ago

Does Claude's Constitution Have a Culture?

Claude's Constitution doesn't create a neutral AI, but instead bakes in the values of Northern European and Anglophone cultures, creating a value floor that's hard to shift.

Parham Pourdavood

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

2d ago

Attesting LLM Pipelines: Enforcing Verifiable Training and Release Claims

Securing LLM supply chains requires cryptographically binding training and release claims to artifacts, enabling verifiable enforcement of security policies across teams and stages.

Zhuoran Tan, Jeremy Singer, Christos Anagnostopoulos

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Open-Source Models & Weights

M. Hassanin2d ago

KAN-LSTM: Benchmarking Kolmogorov-Arnold Networks for Cyber Security Threat Detection in IoT Networks

KANs, by replacing static weights with learnable splines, achieve superior cybersecurity threat detection in IoT networks compared to MLPs, while using significantly fewer parameters.

M. Hassanin

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks

Search

Eval Frameworks & Benchmarks - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (100)