Eval Frameworks & Benchmarks

Safety & Alignment

Evaluation methodology for AI systems, benchmark design, capability measurement, and safety evaluations.

Keywords

LLM evaluationbenchmarkAI evaluationcapability evaluationsafety evaluationleaderboardMMLUevaluation framework

Recent Papers

Feb 12, 2026

2d ago

Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education

The paper introduces the Visual Reasoning Benchmark (VRB), a new dataset of 701 visual reasoning questions sourced from primary school exams in Zambia and India, designed to evaluate multimodal large language models (MLLMs). The VRB focuses on minimal-text images to simulate realistic classroom visual reasoning problems, covering tasks like analogy, pattern completion, and spatial matching. Experiments using the VRB reveal that MLLMs exhibit a "jagged frontier" of capabilities, performing well on static tasks like counting but struggling with dynamic spatial operations like folding and rotation.

Introduces the Visual Reasoning Benchmark (VRB), a novel dataset of classroom-authentic visual reasoning problems, to evaluate the spatial reasoning capabilities of MLLMs.

M. Huti, Alasdair Mackintosh, Amy Waldock +72602.12196

Eval Frameworks & BenchmarksMultimodal ModelsReasoning & Chain-of-Thought

Department of Artificial2d ago

SAGEO Arena: A Realistic Environment for Evaluating Search-Augmented Generative Engine Optimization

The paper introduces SAGEO Arena, a realistic evaluation environment for Search-Augmented Generative Engine Optimization (SAGEO) that addresses limitations of existing benchmarks by incorporating a full generative search pipeline over a large-scale corpus of web documents with rich structural information. They demonstrate that existing optimization approaches are often impractical and degrade performance in retrieval and reranking stages under realistic conditions. The study highlights the importance of structural information and stage-specific optimization for effective SAGEO.

Introduces SAGEO Arena, a novel benchmark environment enabling realistic, stage-level evaluation of search-augmented generative engine optimization strategies.

Sunghwan Kim, Wooseok Jeong, Serin Kim +22602.12187

Eval Frameworks & BenchmarksRecommendation & Information RetrievalNatural Language Processing

LTCI2d ago

TopoFair: Linking Topological Bias to Fairness in Link Prediction Benchmarks

This paper introduces TopoFair, a benchmarking framework for fair link prediction that focuses on the impact of diverse topological biases beyond homophily. They formalize a taxonomy of topological bias measures and develop a graph generation method that allows for controlled variation of these biases while maintaining real-world graph characteristics. Through empirical evaluation of link prediction models, including fairness-aware methods, they demonstrate the sensitivity of fairness interventions to these structural biases.

Introduces a novel benchmarking framework, TopoFair, to analyze the interplay between topological biases and fairness in link prediction.

Lilian Marey, Tiphaine Viard, Charlotte Laclau2602.11802

Constitutional AI & AI EthicsEval Frameworks & BenchmarksRecommendation & Information Retrieval

2d ago

CSEval: A Framework for Evaluating Clinical Semantics in Text-to-Image Generation

This paper introduces CSEval, a framework for evaluating the clinical semantic alignment between text prompts and generated medical images, addressing the limitations of existing metrics focused on realism and diversity. CSEval uses language models to identify semantic inconsistencies related to anatomical location and pathology, demonstrating a correlation with expert clinical judgment. The framework offers a scalable method for assessing the clinical reliability of generated medical images, crucial for the safe deployment of text-to-image models in healthcare.

Introduces CSEval, a novel language model-based framework, to evaluate the clinical semantic alignment between text prompts and generated medical images.

Robert Cronshaw, Konstantinos Vilouras, Steven McDonagh +12602.12004

Eval Frameworks & BenchmarksMultimodal ModelsComputer Vision

2d ago

Agentic Test-Time Scaling for WebAgents

The paper investigates test-time scaling strategies for web agents in multi-step tasks, finding that uniform scaling saturates quickly and LLM-based arbiters can overrule high-consensus decisions. They demonstrate that uncertainty statistics from the agent's vote distribution correlate with task success, enabling dynamic compute allocation. Based on these findings, they introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are contentious, improving performance and efficiency.

Introduces Confidence-Aware Test-Time Scaling (CATTS), a novel method for dynamically allocating compute to web agents based on vote-derived uncertainty, achieving improved performance and efficiency compared to uniform scaling.

Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai +32602.12276

Tool Use & AgentsInference & QuantizationEval Frameworks & Benchmarks

2d ago

Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

The paper introduces RouterXBench, a comprehensive evaluation framework for LLM routers, addressing limitations of existing benchmarks by considering router ability, scenario alignment, and cross-domain robustness. They propose ProbeDirichlet, a novel router that leverages internal hidden states and learnable Dirichlet distributions for probabilistic training, capturing model uncertainty more effectively than methods relying on output probabilities or external embeddings. Empirical results demonstrate that ProbeDirichlet outperforms existing routers, achieving significant improvements in router ability and high-accuracy scenarios, while exhibiting robust generalization across diverse model families, scales, tasks, and workflows.

Introduces ProbeDirichlet, a router that aggregates cross-layer hidden states via learnable Dirichlet distributions for improved uncertainty estimation and routing decisions.

Wanxing Wu, He Zhu, Yixia Li +52602.11877

Eval Frameworks & BenchmarksDistributed Systems & HardwareNatural Language Processing

2d ago

Text2GQL-Bench: A Text to Graph Query Language Benchmark [Experiment, Analysis&Benchmark]

The authors introduce Text2GQL-Bench, a new benchmark for text-to-graph query language translation, comprising 178,184 question-query pairs across 13 domains and supporting multiple graph query languages. They also present a comprehensive evaluation method that assesses grammatical validity, similarity, semantic alignment, and execution accuracy, moving beyond simple end-to-end metrics. Experiments reveal a significant "dialect gap" in ISO-GQL generation, where even strong LLMs struggle in zero-shot settings but improve substantially with few-shot prompting or fine-tuning.

Introduces a unified benchmark, Text2GQL-Bench, for evaluating text-to-graph query language systems, featuring a multi-GQL dataset and a scalable construction framework.

Songlin Lyu, Lujie Ban, Jirong Liu +52602.11745

Eval Frameworks & BenchmarksCode Generation & Program SynthesisTool Use & Agents

2d ago

DeepSight: An All-in-One LM Safety Toolkit

The paper introduces DeepSight, an open-source toolkit designed to integrate safety evaluation and diagnosis for large language models (LLMs) and multimodal large language models (MLLMs). DeepSight combines DeepSafe, an evaluation toolkit, and DeepScan, a diagnosis toolkit, to provide a more comprehensive safety workflow. By unifying task and data protocols, DeepSight aims to bridge the gap between black-box risk evaluation and white-box mechanistic understanding, facilitating targeted safety alignment.

Introduces DeepSight, the first open-source toolkit to support frontier AI risk evaluation and joint safety evaluation and diagnosis by unifying task and data protocols.

Jiaxuan Guo, Lijun Li, Sujin Chen +92602.12092

Red-Teaming & Adversarial RobustnessEval Frameworks & BenchmarksMultimodal Models

2d ago

IncompeBench: A Permissively Licensed, Fine-Grained Benchmark for Music Information Retrieval

The paper introduces IncompeBench, a new benchmark for Music Information Retrieval (MIR) consisting of 1,574 permissively licensed music snippets, 500 diverse queries, and over 125,000 relevance judgements. This benchmark addresses the lack of high-quality evaluation datasets in MIR, enabling more rigorous and reproducible research. High inter-annotator agreement was achieved through a multi-stage annotation pipeline, ensuring data quality.

Provides IncompeBench, a permissively licensed, fine-grained benchmark dataset to facilitate advancements in music information retrieval.

Benjamin Clavi'e, Atoof Shakir, Sean Lee +22602.11941

Eval Frameworks & BenchmarksMultimodal ModelsRecommendation & Information Retrieval

2d ago

WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

The paper introduces WavBench, a new benchmark for end-to-end spoken dialogue models that evaluates reasoning, colloquialism, and paralinguistics, addressing limitations of existing text-centric benchmarks. WavBench comprises three subsets: Pro (reasoning), Basic (colloquialism), and Acoustic (paralinguistics), designed to assess complex problem-solving, natural language fluency, and nuanced understanding/generation of acoustic cues. Evaluation of five state-of-the-art models using WavBench reveals critical insights into model performance across these dimensions, highlighting areas for improvement in building more robust spoken dialogue agents.

Introduces WavBench, a novel benchmark dataset and evaluation toolkit designed to comprehensively assess reasoning, colloquialism, and paralinguistic capabilities in end-to-end spoken dialogue models.

Yangzhuo Li, Yifu Chen, Haorong Ying +32602.12135

Eval Frameworks & BenchmarksReasoning & Chain-of-ThoughtSpeech & Audio

2d ago

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

This paper investigates the effectiveness of repository-level context files (e.g., AGENTS.md) in improving the performance of coding agents on software development tasks. Through experiments on SWE-bench tasks with LLM-generated context files and a novel dataset of issues from repositories with developer-committed context files, the authors find that context files generally decrease task success rates and increase inference costs. They attribute this to unnecessary constraints imposed by the context files, suggesting that human-written context files should be minimal.

Empirically demonstrates that repository-level context files, both LLM-generated and human-written, can hinder the performance of coding agents on software development tasks.

Thibaud Gloaguen, Niels Mundler, M. Muller +22602.11988

Code Generation & Program SynthesisTool Use & AgentsEval Frameworks & Benchmarks

2d ago

Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models

This paper investigates the sensicality of sentences in existing semantically deviant datasets by comparing human and LLM judgments, both with and without provided contexts. The study reveals that humans generally perceive sentences as anomalous rather than nonsensical, suggesting existing datasets may not be as nonsensical as assumed. Furthermore, the research demonstrates LLMs' ability to generate plausible contexts that render anomalous sentences more sensible.

Empirically demonstrates that existing "nonsensical" datasets are largely composed of anomalous sentences interpretable with context, and that LLMs can generate such contexts.

Katrin Olsen, Sebastian Pad'o2602.11699

Natural Language ProcessingEval Frameworks & Benchmarks

2d ago

DynaHOI: Benchmarking Hand-Object Interaction for Dynamic Target

The paper introduces DynaHOI-Gym, a new online closed-loop platform for benchmarking hand motion generation in dynamic hand-object interaction (HOI) scenarios, addressing the limitations of existing benchmarks focused on static objects. To facilitate research, the authors release DynaHOI-10M, a large-scale dataset comprising 10 million frames and 180K hand capture trajectories with diverse target motions. They also present an observe-before-act (ObAct) baseline that leverages spatiotemporal attention, demonstrating improved location success rates in the dynamic HOI setting.

Introduces DynaHOI-Gym and DynaHOI-10M, a novel benchmark and dataset for evaluating hand motion generation in dynamic hand-object interaction scenarios.

Zhonghan Zhao, Hongwei Wang2602.11919

Eval Frameworks & BenchmarksRobotics & Embodied AIComputer Vision

2d ago

Benchmark Illusion: Disagreement among LLMs and Its Scientific Consequences

The paper investigates the phenomenon of "benchmark illusion," where LLMs with similar benchmark accuracy exhibit significant disagreement on individual data points. Using MMLU-Pro and GPQA benchmarks, the authors quantify the disagreement rates between various LLMs, including top-performing frontier models. They demonstrate that this disagreement can lead to substantial variability in scientific research outcomes when LLMs are used for data annotation and inference, impacting the reproducibility of results.

Demonstrates that seemingly convergent benchmark accuracy among LLMs masks substantial disagreement on individual data points, leading to significant consequences for scientific reproducibility.

Eddie Yang, Dashun Wang2602.11898

Eval Frameworks & BenchmarksReasoning & Chain-of-ThoughtScientific Discovery & Drug Design

2d ago

It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

The paper introduces TIME, a new time series forecasting benchmark designed to address limitations in existing benchmarks related to data composition, integrity, task formulation, and analysis perspectives. TIME comprises 50 fresh datasets and 98 forecasting tasks constructed using a human-in-the-loop pipeline to ensure data integrity and real-world alignment. The benchmark also introduces a pattern-level evaluation perspective based on structural time series features to provide generalizable insights into model capabilities, and the authors evaluate 12 TSFMs on TIME.

Introduces TIME, a novel task-centric time series forecasting benchmark with enhanced data integrity, real-world task formulations, and pattern-level evaluation.

Zhongzheng Qiao, Viktoriya Zhukova, Qingsong Wen +22602.12147

Eval Frameworks & BenchmarksData Curation & Synthetic Data

2d ago

Compiler-Guided Inference-Time Adaptation: Improving GPT-5 Programming Performance in Idris

This paper investigates GPT-5's ability to learn Idris, a functional programming language, through iterative prompting strategies. The authors found that zero-shot performance on Idris programming exercises was significantly lower than performance on Python and Erlang. By incorporating local compilation errors into the prompts, the authors achieved a substantial performance increase, solving 54 out of 56 problems.

Demonstrates that compiler-guided, error-driven iterative prompting significantly improves GPT-5's performance in a low-resource programming language.

Minda Li, Bhaskar Krishnamachari2602.11481

Code Generation & Program SynthesisEval Frameworks & Benchmarks

2d ago

When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

The paper investigates modality arbitration in Audio-LLMs, revealing a strong bias towards text over audio when the two modalities conflict, even when audio quality is superior. Using the ALME benchmark, the authors demonstrate that Gemini 2.0 Flash exhibits significantly higher text dominance in audio-text conflicts compared to text-text conflicts. They propose that this text dominance arises from an asymmetry in arbitration accessibility rather than information content, and provide evidence through interventions like forced transcription and fine-tuning ablations.

Reveals and analyzes a significant text dominance bias in audio-LLMs during modality arbitration, attributing it to differences in representational accessibility rather than information content.

Jayadev Billa2602.11488

Eval Frameworks & BenchmarksMultimodal ModelsSpeech & Audio

2d ago

RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis

This paper introduces RooflineBench, a benchmarking framework for on-device LLMs based on the Roofline model, using operational intensity (OI) to unify architectural primitives and hardware constraints. They define an inference-potential region and introduce Relative Inference Potential to compare LLM efficiency on the same hardware. Empirical analysis reveals that sequence length significantly influences performance and OI, identifies OI regression with model depth, and demonstrates how structural refinements like M-LA can unlock inference potential.

Introduces RooflineBench, a novel benchmarking framework leveraging Roofline analysis and operational intensity to evaluate and optimize on-device LLM performance across diverse hardware platforms.

Zhen Bi, Luoyang Sun, Qing Shen +12602.11506

Eval Frameworks & BenchmarksInference & QuantizationDistributed Systems & Hardware

2d ago

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

The paper investigates the failure of speech recognition models on transcribing U.S. street names, finding a 44% error rate across 15 models from major vendors and disproportionately larger routing distance errors for non-English primary speakers. It highlights the gap between benchmark performance and real-world reliability, particularly for high-stakes tasks involving named entities. The authors then demonstrate that fine-tuning with a small, synthetically generated dataset of diverse pronunciations improves street name transcription accuracy by nearly 60% for non-English primary speakers.

Demonstrates that speech recognition models exhibit significant transcription errors on street names, particularly impacting non-English speakers, and mitigates this issue through synthetic data augmentation.

Martijn Bartelds, Federico Bianchi2602.12249

Eval Frameworks & BenchmarksNatural Language ProcessingSpeech & Audio

2d ago

Benchmarking for Single Feature Attribution with Microarchitecture Cliffs

This paper introduces Microarchitecture Cliffs, a benchmark generation methodology to identify and attribute microarchitectural mismatches between architectural simulators and RTL implementations for model calibration. The Cliff methodology generates benchmarks that isolate individual microarchitectural features, enabling precise attribution of behavioral differences. Applying this methodology to calibrate XS-GEM5 against XS-RTL, the authors reduced performance error on Cliff benchmarks from 59.2% to 1.4% and improved performance prediction accuracy on SPEC2017 benchmarks.

Introduces a novel benchmark generation methodology, Microarchitecture Cliffs, for isolating and attributing microarchitectural discrepancies between simulators and RTL implementations, significantly improving simulator calibration accuracy.

Hao Zhen, Qingxuan Kang, Yungang Bao +12602.11580

Eval Frameworks & BenchmarksArchitecture Design (Transformers, SSMs, MoE)

2d ago

Detecting RLVR Training Data via Structural Convergence of Reasoning

The paper addresses the problem of detecting training data contamination in Reinforcement Learning with Verifiable Rewards (RLVR) fine-tuned reasoning models, where standard likelihood-based detection methods are ineffective. They observe that RLVR training leads to a structural convergence in the model's generations for seen prompts, resulting in more rigid and similar outputs compared to unseen prompts. They introduce Min-$k$NN Distance, a black-box detector that leverages this convergence by measuring the average of the $k$ smallest nearest-neighbor edit distances between multiple completions of a given prompt.

Introduces Min-$k$NN Distance, a novel black-box detector, to identify RLVR training data by quantifying the structural convergence of reasoning trajectories induced by RLVR.

Hongbo Zhang, Yue Yang, Guangsheng Bao2602.11792

RLHF & Preference LearningReasoning & Chain-of-ThoughtEval Frameworks & Benchmarks

2d ago

Multimodal Fact-Level Attribution for Verifiable Reasoning

The paper introduces MuRGAt, a new benchmark for evaluating fact-level multimodal attribution in complex reasoning scenarios involving video, audio, and other modalities. MuRGAt requires models to generate answers with explicit reasoning and precise citations that specify modality and temporal segments. The authors also present an automatic evaluation framework that correlates with human judgments, revealing that current MLLMs often hallucinate citations even with correct reasoning, and that increasing reasoning depth can degrade attribution accuracy.

Introduces MuRGAt, a challenging benchmark and automatic evaluation framework for fact-level multimodal attribution that exposes limitations in current MLLMs' ability to ground reasoning in heterogeneous input sources.

David Wan, Elias Stengel-Eskin, Hyunji Lee +12602.11509

Eval Frameworks & BenchmarksMultimodal ModelsReasoning & Chain-of-Thought

2d ago

LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

The paper investigates how to best pretrain small language models (SLMs) to decide which tokens to predict directly versus delegating to an external source via a special token. They find that loss alone is insufficient for determining optimal delegation, as some high-loss tokens represent acceptable alternative continuations. They introduce LaCy, a pretraining method that uses a spaCy grammar parser to augment the loss signal, enabling SLMs to learn when to delegate and resulting in improved FactScore in cascaded generation setups compared to other methods.

Introduces LaCy, a pretraining method that leverages a spaCy grammar parser to augment the loss signal, enabling SLMs to learn when to delegate token prediction to an external source.

Szilvia Ujv'ary, Louis B'ethune, Pierre Ablin +32602.12005

Eval Frameworks & BenchmarksTool Use & AgentsRecommendation & Information Retrieval

2d ago

Do Large Language Models Adapt to Language Variation across Socioeconomic Status?

This paper investigates the ability of Large Language Models (LLMs) to adapt to language variations across different socioeconomic status (SES) communities by comparing LLM-generated text completions with original text from a novel Reddit and YouTube dataset stratified by SES. The study analyzes 94 sociolinguistic features to assess the degree of stylistic adaptation exhibited by four LLMs. Results indicate that LLMs show limited stylistic modulation with respect to SES, often producing approximations or caricatures, and demonstrate a bias towards emulating upper SES styles, highlighting the risk of amplifying linguistic hierarchies.

Reveals that LLMs exhibit limited stylistic adaptation across socioeconomic strata and tend to favor upper SES linguistic styles, raising concerns about perpetuating linguistic biases.

Elisa Bassignana, Mike Zhang, Dirk Hovy +12602.11939

Constitutional AI & AI EthicsNatural Language ProcessingEval Frameworks & Benchmarks

2d ago

Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

This paper investigates the impact of underspecified questions on QA performance, finding that a significant portion of questions in standard QA benchmarks are underspecified. They introduce an LLM-based classifier to identify these questions and demonstrate that LLMs perform worse on them. Through a controlled rewriting experiment, they show that rewriting underspecified questions into fully specified variants, while keeping the gold answers fixed, consistently improves QA performance.

Demonstrates that question underspecification is a significant confound in QA evaluation by showing that rewriting underspecified questions improves QA performance.

Yunchong Huang, Gianni Barlacchi, Sandro Pezzelle2602.11938

Eval Frameworks & BenchmarksNatural Language Processing

2d ago

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

The paper addresses object hallucination in Multimodal Large Language Models (MLLMs) by improving visual contrastive decoding (VCD) through the creation of an object-aligned auxiliary view. This auxiliary view is constructed by masking the most salient visual evidence based on object-centric attention from self-supervised Vision Transformers, thereby disrupting unsupported tokens during decoding. The proposed method, "Mask What Matters," is prompt-agnostic, model-agnostic, and computationally efficient, leading to improved performance on object hallucination benchmarks.

Introduces a novel object-aligned visual contrastive decoding method that masks salient visual features to mitigate object hallucinations in MLLMs.

Xudong Liu2602.11737

Multimodal ModelsComputer VisionEval Frameworks & Benchmarks

2d ago

ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer's Disease and Related Dementias

The authors introduce ADRD-Bench, a new benchmark dataset for evaluating LLMs on Alzheimer's Disease and Related Dementias (ADRD), comprising a unified QA set from existing medical benchmarks and a novel QA set derived from the Aging Brain Care (ABC) program. They aim to address the lack of ADRD-specific evaluation resources and practical caregiving context in existing benchmarks. Evaluating 33 state-of-the-art LLMs, they found that while some models achieve high accuracy, inconsistencies in reasoning quality and stability remain a significant limitation.

Introduces ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs, incorporating both unified clinical knowledge and practical caregiving questions.

Jiahao Zheng, Malaz Boustani, J. Nabrzyski2602.11460

Eval Frameworks & BenchmarksScientific Discovery & Drug DesignNatural Language Processing

2d ago

Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards

This paper introduces an online reinforcement learning (RL) approach to improve the high-performance computing (HPC) code generation capabilities of large language models (LLMs) by using runtime performance (GFLOPS) on a supercomputer as a direct reward signal. They propose a Staged Quality-Diversity (SQD) algorithm that progressively varies optimization techniques to encourage diverse learning. The authors trained Qwen2.5 Coder 14B on a double-precision matrix multiplication task using Group Relative Policy Optimization (GRPO), demonstrating improved HPC code generation.

Demonstrates that online reinforcement learning with real-machine benchmark rewards and staged optimization significantly improves the HPC code generation performance of LLMs.

Ryo Mikasa, Shun-ichiro Hayashi, Daichi Mukunoki +22602.12049

Code Generation & Program SynthesisRLHF & Preference LearningEval Frameworks & Benchmarks

2d ago

Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion

This paper introduces a French-focused benchmark for PDF-to-Markdown conversion using VLMs, addressing the lack of evaluation datasets for non-English documents and the over-penalization of formatting variations in existing benchmarks. The benchmark consists of challenging French documents selected via model-disagreement sampling and is evaluated using unit-test-style checks targeting specific failure modes like text presence and reading order, combined with category-specific normalization. Results across 15 models show that proprietary models exhibit higher robustness on handwriting and forms, while open-weight models are competitive on standard layouts.

Introduces a new French-language PDF-to-Markdown benchmark with targeted unit tests and category-specific normalization to more accurately assess VLM performance in RAG pipelines.

Bruno Rigal, Victor Dupriez, Alexis Mignon +22602.11960

Multimodal ModelsEval Frameworks & BenchmarksRecommendation & Information Retrieval

2d ago

PatientHub: A Unified Framework for Patient Simulation

The paper introduces PatientHub, a unified framework to standardize the creation, composition, and deployment of simulated patients for training counselors and scaling therapeutic assessment using Large Language Models. PatientHub addresses the fragmentation in existing patient simulation approaches by providing standardized data formats, prompts, and evaluation metrics, thus improving reproducibility and enabling fair comparisons. The authors demonstrate PatientHub's utility through case studies, showcasing standardized cross-method evaluation, seamless integration of custom evaluation metrics, and the prototyping of new simulator variants.

Introduces PatientHub, a modular framework that unifies patient simulation by standardizing data formats, prompts, and evaluation metrics to facilitate reproducibility and fair comparison of different methods.

Sahand Sabour, NG TszYam2602.11684

Eval Frameworks & BenchmarksOpen-Source Models & WeightsNatural Language Processing

2d ago

V-SHiNE: A Virtual Smart Home Framework for Explainability Evaluation

The paper introduces V-SHiNE, a browser-based virtual smart home environment designed to facilitate the evaluation of explainable AI (XAI) methods in the context of smart home automation. V-SHiNE enables researchers to configure realistic smart home environments, simulate user behaviors, integrate custom explanation engines, and log user interactions. A user study with 159 participants demonstrates the framework's utility for assessing the impact and quality of different explanation strategies.

Introduces V-SHiNE, a novel browser-based simulation framework, to enable scalable and reproducible evaluation of XAI methods within virtual smart home environments.

Mersedeh Sadeghi, Simon Scholz, Max Unterbusch +12602.11775

Eval Frameworks & BenchmarksWorld Models & PlanningInterpretability & Mechanistic Interp

Università della Svizzera2d ago

Studying Quality Improvements Recommended via Manual and Automated Code Review

This paper investigates the overlap between code review comments generated by human reviewers and those produced by ChatGPT-4, focusing on the types of quality improvements recommended. The authors manually classified 739 human-generated comments from 240 pull requests and compared them to ChatGPT-4's recommendations on the same PRs. Results indicate that while ChatGPT-4 suggests more changes overall, it only identifies 10% of the issues flagged by humans, though 40% of ChatGPT-4's additional suggestions are valuable, highlighting the complementary nature of both approaches.

Quantifies the overlap and differences in quality improvement recommendations between human code reviewers and ChatGPT-4, revealing the strengths and weaknesses of each approach.

Giuseppe Crupi, Rosalia Tufano, Gabriele Bavota2602.11925

Code Generation & Program SynthesisEval Frameworks & BenchmarksNatural Language Processing

2d ago

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

The paper introduces CM2, a reinforcement learning framework that utilizes checklist rewards instead of verifiable outcome rewards to train agents for multi-turn, multi-step tool use. CM2 decomposes each turn's behavior into fine-grained binary criteria with evidence grounding, enabling more stable classification-style reward signals. Experiments in an LLM-simulated tool environment demonstrate that CM2 significantly outperforms supervised fine-tuning baselines on benchmarks like tau^-Bench, BFCL-V4, and ToolSandbox, achieving comparable or superior performance to similarly sized open-source models.

This paper introduces a novel reinforcement learning framework, CM2, that replaces traditional verifiable rewards with checklist-based rewards for training agents to effectively use tools in multi-turn, multi-step interactions.

Xun Wang, Yebowen Hu, Chenyang Zhao +52602.12268

RLHF & Preference LearningTool Use & AgentsEval Frameworks & Benchmarks

2d ago

ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction

The authors introduce ExtractBench, a new benchmark and evaluation framework for end-to-end PDF-to-JSON structured extraction, designed to address the lack of comprehensive benchmarks and principled evaluation methodologies for complex, nested extraction tasks. ExtractBench comprises 35 PDF documents paired with JSON Schemas and human-annotated gold labels across diverse domains, resulting in 12,867 evaluatable fields with varying schema complexities. Evaluations using ExtractBench reveal that state-of-the-art LLMs struggle with realistic schemas, particularly as schema breadth increases, with some models achieving 0% valid output on a 369-field schema.

Introduces ExtractBench, a novel benchmark and evaluation framework, to address the limitations of existing methods in evaluating complex structured extraction from PDFs using LLMs.

Nick Ferguson, Josh Pennington, Narek Beghian +32602.12247

Eval Frameworks & BenchmarksNatural Language Processing

2d ago

PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

The paper introduces PRIME, a new benchmark designed to evaluate verifiers for process-outcome alignment in mathematical and engineering problem-solving, addressing the limitations of outcome-centric verification methods in Reinforcement Learning with Verifiable Rewards (RLVR). PRIME consists of 2,530 high-difficulty STEM problems and is used to demonstrate that existing verifiers often fail to identify flaws in the derivation process. The authors show that RLVR training using verifiers selected based on PRIME significantly improves performance on challenging math problem sets, and that PRIME's accuracy strongly correlates with RLVR training effectiveness.

Introduces PRIME, a novel benchmark for evaluating the ability of verifiers to align the reasoning process with the final outcome in complex STEM problems.

Yinmin Zhang, Chun Yuan, Tong Xu +12602.11570

Eval Frameworks & BenchmarksReasoning & Chain-of-Thought

University of Delaware2d ago

Wisdom of the LLM Crowd: A Large Scale Benchmark of Multi-Label U.S. Election-Related Harmful Social Media Content

This paper introduces USE24-XD, a dataset of approximately 100,000 social media posts from X related to the 2024 U.S. presidential election, categorized into five harmful content types using a "wisdom of the crowd" approach with six LLMs. The study validates LLM annotations against human crowdsourcing, finding comparable agreement and high recall for specific categories like Speculation. Analysis of human annotator demographics reveals systematic biases in labeling harmful content, underscoring the subjectivity inherent in such judgments.

Introduces USE24-XD, a large-scale, multi-labeled dataset of election-related social media content annotated by LLMs and validated by human annotators, to facilitate research on harmful online narratives.

Qile Wang, Prerana Khatiwada, Carolina Coimbra Vieira +32602.11962

Eval Frameworks & BenchmarksConstitutional AI & AI EthicsNatural Language Processing

2d ago

GPT-4o Lacks Core Features of Theory of Mind

This paper investigates whether GPT-4o possesses a genuine Theory of Mind (ToM) by evaluating its ability to model the causal relationship between mental states and behavior. The authors developed a novel evaluation framework based on a cognitively-grounded definition of ToM, probing for coherence, domain-generality, and consistency in the model's understanding of mental state causality. The key finding is that while GPT-4o can approximate human judgments in simple ToM tasks, it fails on logically equivalent tasks and demonstrates low consistency between predicted actions and inferred mental states, suggesting a lack of a robust ToM.

Demonstrates that GPT-4o, despite apparent social proficiency, lacks a coherent, domain-general, and consistent Theory of Mind by revealing inconsistencies in its mental state inferences and action predictions.

John Muchovej, Amanda L. Royka, Julian Jara-Ettinger2602.12150

Eval Frameworks & BenchmarksReasoning & Chain-of-ThoughtNatural Language Processing

2d ago

LLM-based Triplet Extraction from Financial Reports

This paper introduces a semi-automated pipeline for extracting Subject-Predicate-Object triplets from financial reports using LLMs, addressing the lack of ground truth data by employing ontology-driven proxy metrics like Ontology Conformance and Faithfulness. The authors compare a manually engineered ontology with a document-specific, automatically induced ontology, finding that the latter achieves 100% schema conformance and eliminates ontology drift. They also propose a hybrid verification strategy combining regex matching and LLM-as-a-judge to reduce subject hallucination rates, and identify asymmetries in subject/object hallucinations.

Introduces a semi-automated pipeline for LLM-based triplet extraction from financial reports evaluated using ontology-driven proxy metrics, circumventing the need for annotated ground truth.

Dante Wesslund, Ville Stenstrom, Pontus Linde +12602.11886

Natural Language ProcessingEval Frameworks & BenchmarksTool Use & Agents

2d ago

Using predictive multiplicity to measure individual performance within the AI Act

This paper analyzes predictive multiplicity, the phenomenon of multiple AI models with similar overall accuracy disagreeing on individual predictions, in the context of the EU AI Act. It argues that high predictive multiplicity violates the Act's requirements for individual-level performance reporting, as it introduces arbitrariness in decisions impacting humans. The paper proposes individual conflict ratios and $\delta$-ambiguity as metrics to quantify disagreement between models on individual cases and offers practical guidelines for model providers to evaluate and report predictive multiplicity.

Proposes individual conflict ratios and $\delta$-ambiguity as metrics to quantify predictive multiplicity and facilitate compliance with the EU AI Act's accuracy provisions.

Karolin Frohnapfel, Mara Seyfert, Sebastian Bordt +22602.11944

Constitutional AI & AI EthicsEval Frameworks & Benchmarks

2d ago

Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study

The paper introduces CLUES, a framework that decomposes semantic uncertainty in Text-to-SQL into ambiguity and instability scores to differentiate between input ambiguity requiring clarification and model instability requiring human review. CLUES models Text-to-SQL as a two-stage process of interpretations to answers and computes instability using the Schur complement of a bipartite semantic graph matrix. Experiments on AmbigQA/SituatedQA and a clinical Text-to-SQL benchmark demonstrate that CLUES improves failure prediction compared to Kernel Language Entropy and provides diagnostic decomposition for targeted interventions.

Introduces CLUES, a novel framework that decomposes semantic uncertainty in Text-to-SQL into ambiguity and instability scores, enabling targeted interventions for query refinement and model improvement.

Angelo Ziletti, Leonardo D'Ambrosi2602.12015

Code Generation & Program SynthesisNatural Language ProcessingEval Frameworks & Benchmarks

2d ago

When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation

The paper introduces Selective Abstraction (SA), a framework for improving the reliability of long-form text generation by allowing LLMs to selectively reduce the specificity of uncertain content instead of abstaining entirely. They formalize SA using selective risk and coverage metrics and propose Atom-wise Selective Abstraction, which decomposes responses into atomic claims and replaces uncertain claims with more general abstractions. Empirical evaluation on FactScore and LongFact-Objects benchmarks demonstrates that Atom-wise SA significantly improves the risk-coverage trade-off compared to claim removal, boosting AURC by up to 27.73% across six open-source models.

Introduces Selective Abstraction, a novel framework enabling LLMs to trade specificity for reliability in long-form generation by selectively abstracting uncertain content.

Shani Goren, Ido Galil, Ran El-Yaniv2602.11908

Natural Language ProcessingEval Frameworks & BenchmarksConstitutional AI & AI Ethics

2d ago

WebTestPilot: Agentic End-to-End Web Testing against Natural Language Specification by Inferring Oracles with Symbolized GUI Elements

The paper introduces WebTestPilot, an LLM-based agent for end-to-end web testing against natural language specifications that addresses the challenges of implicit oracle inference and probabilistic reasoning. WebTestPilot uses a symbolization layer to represent GUI elements as symbols and translates natural language into step-by-step instructions with inferred pre- and post-conditions over these symbols, effectively capturing data, temporal, and causal dependencies for validation. Experiments on a new benchmark of bug-injected web applications demonstrate that WebTestPilot achieves a 99% task completion rate with 96% precision and 96% recall in bug detection, significantly outperforming existing LLM-based approaches.

Introduces a novel approach to end-to-end web testing by inferring oracles with symbolized GUI elements, enabling the agent to validate implicit requirements and improve bug detection accuracy.

Xiwen Teoh, Yun Lin, Duc-Minh Nguyen +12602.11724

Tool Use & AgentsMultimodal ModelsEval Frameworks & Benchmarks

2d ago

The Observer Effect in World Models: Invasive Adaptation Corrupts Latent Physics

The paper investigates whether neural world models truly learn physical laws or rely on statistical shortcuts, particularly under out-of-distribution shifts. They introduce PhyIP, a non-invasive evaluation protocol that assesses the linear decodability of physical quantities from frozen latent representations, contrasting it with adaptation-based methods. Their results show that when self-supervised learning achieves low error, latent physical structures are linearly accessible and robust to OOD shifts, while adaptation-based evaluations can collapse this structure, suggesting that non-invasive probes are more accurate for evaluating physical world models.

Introduces PhyIP, a non-invasive evaluation protocol, to accurately assess the linear accessibility of physical quantities in frozen latent representations of world models, demonstrating its superiority over adaptation-based methods.

Christian Internò, Jumpei Yamaguchi, Loren K. Amdahl-Culleton +32602.12218

World Models & PlanningInterpretability & Mechanistic InterpEval Frameworks & Benchmarks

2d ago

DMAP: A Distribution Map for Text

The paper introduces Distribution Map (DMAP), a novel method for representing text using next-token probability distributions from LLMs by mapping text to samples in the unit interval that encode rank and probability. DMAP addresses the limitations of perplexity by accounting for context and the shape of the conditional distribution. The authors demonstrate DMAP's utility in validating generation parameters, detecting machine-generated text via probability curvature, and performing forensic analysis of models fine-tuned on synthetic data.

Introduces DMAP, a mathematically grounded method for representing text as a distribution of samples in the unit interval based on next-token probability distributions from LLMs, enabling efficient and model-agnostic text analysis.

Tom Kempton, Julia Rozanova, Parameswaran Kamalaruban +52602.11871

Natural Language ProcessingEval Frameworks & BenchmarksInterpretability & Mechanistic Interp

2d ago

Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs: A Systematic Evaluation

This paper investigates the use of LLMs (Claude Sonnet 4.5 and GPT-5.2) for co-evolving textual Domain-Specific Languages (DSLs) and their instances when grammars change, addressing the limitations of traditional model-driven engineering techniques in preserving human-relevant information. The study systematically evaluates the correctness and information preservation capabilities of these LLMs across ten case languages and multiple runs, varying the scale and complexity of the grammar evolutions. Results indicate high performance on small-scale instances but a significant performance degradation with increasing instance size and grammar evolution complexity, highlighting current limitations in LLM-based co-evolution for larger and more complex DSLs.

Systematically evaluates the capabilities of LLMs, specifically Claude Sonnet 4.5 and GPT-5.2, in co-evolving textual DSL grammars and instances, quantifying their performance with respect to correctness, information preservation, and scalability.

Weixing Zhang, A. Koziolek, Regina Hebig +12602.11904

Code Generation & Program SynthesisEval Frameworks & BenchmarksNatural Language Processing

2d ago

Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision

The paper introduces Sci-CoE, a two-stage co-evolution framework for scientific reasoning LLMs that transitions from sparse supervision to unsupervised learning. Sci-CoE uses a small labeled dataset to bootstrap a Verifier and then employs a geometric reward mechanism incorporating consensus, reliability, and diversity to drive self-iteration on unlabeled data. Experiments on scientific benchmarks demonstrate that Sci-CoE improves complex reasoning capabilities and evaluation robustness.

Introduces a geometric reward mechanism that jointly considers consensus, reliability, and diversity to drive the co-evolution of scientific reasoning LLMs in an unsupervised manner.

Songtao Huang, Lei Bai, Bin Wang2602.12164

Reasoning & Chain-of-ThoughtScientific Discovery & Drug DesignEval Frameworks & Benchmarks

2d ago

Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment

This paper introduces the Value Alignment Tax (VAT), a framework to quantify how aligning LLMs to specific values impacts the broader value system. VAT measures the trade-offs between gains in target value alignment and changes in other interconnected values. Using a dataset of scenario-action pairs grounded in Schwartz value theory, the authors demonstrate that alignment interventions induce structured co-movement among values, which are often missed by target-only evaluations.

Introduces the Value Alignment Tax (VAT) framework to quantify and analyze the systemic effects of value alignment interventions in LLMs.

Jiajun Chen, Hua Shen2602.12134

Constitutional AI & AI EthicsRLHF & Preference LearningEval Frameworks & Benchmarks

2d ago

AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild

The paper introduces AmbiBench, a new benchmark designed to evaluate mobile GUI agents' ability to handle ambiguous instructions and engage in interactive intent alignment, moving beyond the limitations of existing benchmarks that focus on one-shot, complete instructions. The benchmark is structured around a taxonomy of instruction clarity levels (Detailed, Standard, Incomplete, Ambiguous) based on Cognitive Gap theory and includes 240 real-world tasks across 25 applications. The authors also present MUSE, an automated evaluation framework using an MLLM-as-a-judge multi-agent architecture, demonstrating its utility in assessing agent performance across different clarity levels and its correlation with human judgment.

Introduces AmbiBench, a novel benchmark for evaluating mobile GUI agents on their ability to handle ambiguous instructions and engage in interactive intent alignment, along with an automated evaluation framework called MUSE.

Jiazheng Sun, Mingxuan Li, Yingying Zhang +82602.11750

Eval Frameworks & BenchmarksTool Use & AgentsMultimodal Models

2d ago

How Smart Is Your GUI Agent? A Framework for the Future of Software Interaction

This paper introduces the GUI Agent Autonomy Levels (GAL) framework, a six-level scale for classifying the autonomy of GUI agents interacting with software. The framework aims to clarify the varying degrees of autonomy currently attributed to GUI agents, addressing ambiguity in capability, responsibility, and risk. By providing a standardized benchmark, GAL facilitates progress towards more trustworthy software interaction.

Proposes the GUI Agent Autonomy Levels (GAL) framework to categorize and benchmark the autonomy of GUI agents.

2602.11514

Tool Use & AgentsEval Frameworks & Benchmarks

2d ago

Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

The paper introduces Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework for assessing LLM safety under repeated inference, addressing the limitations of breadth-oriented benchmarks. APST models safety failures as stochastic outcomes using Bernoulli and binomial models to estimate per-inference failure probabilities under controlled operational conditions like decoding temperature. Experiments on instruction-tuned LLMs using AIR-BENCH-derived safety prompts reveal that models with similar benchmark scores can exhibit significantly different empirical failure rates under repeated sampling, especially with increased temperature, highlighting the importance of evaluating reliability under sustained use.

Introduces Accelerated Prompt Stress Testing (APST), a novel framework for evaluating LLM safety and reliability by repeatedly sampling identical prompts to surface latent failure modes and quantify per-inference failure probabilities.

Keita Broadwater2602.11786

Red-Teaming & Adversarial RobustnessEval Frameworks & BenchmarksConstitutional AI & AI Ethics

Lattice is designed for desktop

Eval Frameworks & Benchmarks

Keywords

Top Labs in This Topic

Recent Papers