Eval Frameworks & Benchmarks
Safety & AlignmentEvaluation methodology for AI systems, benchmark design, capability measurement, and safety evaluations.
Keywords
Top Labs in This Topic
Recent Papers
The paper introduces the Visual Reasoning Benchmark (VRB), a new dataset of 701 visual reasoning questions sourced from primary school exams in Zambia and India, designed to evaluate multimodal large language models (MLLMs). The VRB focuses on minimal-text images to simulate realistic classroom visual reasoning problems, covering tasks like analogy, pattern completion, and spatial matching. Experiments using the VRB reveal that MLLMs exhibit a "jagged frontier" of capabilities, performing well on static tasks like counting but struggling with dynamic spatial operations like folding and rotation.
Introduces the Visual Reasoning Benchmark (VRB), a novel dataset of classroom-authentic visual reasoning problems, to evaluate the spatial reasoning capabilities of MLLMs.
The paper introduces SAGEO Arena, a realistic evaluation environment for Search-Augmented Generative Engine Optimization (SAGEO) that addresses limitations of existing benchmarks by incorporating a full generative search pipeline over a large-scale corpus of web documents with rich structural information. They demonstrate that existing optimization approaches are often impractical and degrade performance in retrieval and reranking stages under realistic conditions. The study highlights the importance of structural information and stage-specific optimization for effective SAGEO.
Introduces SAGEO Arena, a novel benchmark environment enabling realistic, stage-level evaluation of search-augmented generative engine optimization strategies.
This paper introduces TopoFair, a benchmarking framework for fair link prediction that focuses on the impact of diverse topological biases beyond homophily. They formalize a taxonomy of topological bias measures and develop a graph generation method that allows for controlled variation of these biases while maintaining real-world graph characteristics. Through empirical evaluation of link prediction models, including fairness-aware methods, they demonstrate the sensitivity of fairness interventions to these structural biases.
Introduces a novel benchmarking framework, TopoFair, to analyze the interplay between topological biases and fairness in link prediction.
This paper introduces CSEval, a framework for evaluating the clinical semantic alignment between text prompts and generated medical images, addressing the limitations of existing metrics focused on realism and diversity. CSEval uses language models to identify semantic inconsistencies related to anatomical location and pathology, demonstrating a correlation with expert clinical judgment. The framework offers a scalable method for assessing the clinical reliability of generated medical images, crucial for the safe deployment of text-to-image models in healthcare.
Introduces CSEval, a novel language model-based framework, to evaluate the clinical semantic alignment between text prompts and generated medical images.
The paper investigates test-time scaling strategies for web agents in multi-step tasks, finding that uniform scaling saturates quickly and LLM-based arbiters can overrule high-consensus decisions. They demonstrate that uncertainty statistics from the agent's vote distribution correlate with task success, enabling dynamic compute allocation. Based on these findings, they introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are contentious, improving performance and efficiency.
Introduces Confidence-Aware Test-Time Scaling (CATTS), a novel method for dynamically allocating compute to web agents based on vote-derived uncertainty, achieving improved performance and efficiency compared to uniform scaling.
The paper introduces RouterXBench, a comprehensive evaluation framework for LLM routers, addressing limitations of existing benchmarks by considering router ability, scenario alignment, and cross-domain robustness. They propose ProbeDirichlet, a novel router that leverages internal hidden states and learnable Dirichlet distributions for probabilistic training, capturing model uncertainty more effectively than methods relying on output probabilities or external embeddings. Empirical results demonstrate that ProbeDirichlet outperforms existing routers, achieving significant improvements in router ability and high-accuracy scenarios, while exhibiting robust generalization across diverse model families, scales, tasks, and workflows.
Introduces ProbeDirichlet, a router that aggregates cross-layer hidden states via learnable Dirichlet distributions for improved uncertainty estimation and routing decisions.
The authors introduce Text2GQL-Bench, a new benchmark for text-to-graph query language translation, comprising 178,184 question-query pairs across 13 domains and supporting multiple graph query languages. They also present a comprehensive evaluation method that assesses grammatical validity, similarity, semantic alignment, and execution accuracy, moving beyond simple end-to-end metrics. Experiments reveal a significant "dialect gap" in ISO-GQL generation, where even strong LLMs struggle in zero-shot settings but improve substantially with few-shot prompting or fine-tuning.
Introduces a unified benchmark, Text2GQL-Bench, for evaluating text-to-graph query language systems, featuring a multi-GQL dataset and a scalable construction framework.
The paper introduces DeepSight, an open-source toolkit designed to integrate safety evaluation and diagnosis for large language models (LLMs) and multimodal large language models (MLLMs). DeepSight combines DeepSafe, an evaluation toolkit, and DeepScan, a diagnosis toolkit, to provide a more comprehensive safety workflow. By unifying task and data protocols, DeepSight aims to bridge the gap between black-box risk evaluation and white-box mechanistic understanding, facilitating targeted safety alignment.
Introduces DeepSight, the first open-source toolkit to support frontier AI risk evaluation and joint safety evaluation and diagnosis by unifying task and data protocols.
The paper introduces IncompeBench, a new benchmark for Music Information Retrieval (MIR) consisting of 1,574 permissively licensed music snippets, 500 diverse queries, and over 125,000 relevance judgements. This benchmark addresses the lack of high-quality evaluation datasets in MIR, enabling more rigorous and reproducible research. High inter-annotator agreement was achieved through a multi-stage annotation pipeline, ensuring data quality.
Provides IncompeBench, a permissively licensed, fine-grained benchmark dataset to facilitate advancements in music information retrieval.
The paper introduces WavBench, a new benchmark for end-to-end spoken dialogue models that evaluates reasoning, colloquialism, and paralinguistics, addressing limitations of existing text-centric benchmarks. WavBench comprises three subsets: Pro (reasoning), Basic (colloquialism), and Acoustic (paralinguistics), designed to assess complex problem-solving, natural language fluency, and nuanced understanding/generation of acoustic cues. Evaluation of five state-of-the-art models using WavBench reveals critical insights into model performance across these dimensions, highlighting areas for improvement in building more robust spoken dialogue agents.
Introduces WavBench, a novel benchmark dataset and evaluation toolkit designed to comprehensively assess reasoning, colloquialism, and paralinguistic capabilities in end-to-end spoken dialogue models.
This paper investigates the effectiveness of repository-level context files (e.g., AGENTS.md) in improving the performance of coding agents on software development tasks. Through experiments on SWE-bench tasks with LLM-generated context files and a novel dataset of issues from repositories with developer-committed context files, the authors find that context files generally decrease task success rates and increase inference costs. They attribute this to unnecessary constraints imposed by the context files, suggesting that human-written context files should be minimal.
Empirically demonstrates that repository-level context files, both LLM-generated and human-written, can hinder the performance of coding agents on software development tasks.
This paper investigates the sensicality of sentences in existing semantically deviant datasets by comparing human and LLM judgments, both with and without provided contexts. The study reveals that humans generally perceive sentences as anomalous rather than nonsensical, suggesting existing datasets may not be as nonsensical as assumed. Furthermore, the research demonstrates LLMs' ability to generate plausible contexts that render anomalous sentences more sensible.
Empirically demonstrates that existing "nonsensical" datasets are largely composed of anomalous sentences interpretable with context, and that LLMs can generate such contexts.
The paper introduces DynaHOI-Gym, a new online closed-loop platform for benchmarking hand motion generation in dynamic hand-object interaction (HOI) scenarios, addressing the limitations of existing benchmarks focused on static objects. To facilitate research, the authors release DynaHOI-10M, a large-scale dataset comprising 10 million frames and 180K hand capture trajectories with diverse target motions. They also present an observe-before-act (ObAct) baseline that leverages spatiotemporal attention, demonstrating improved location success rates in the dynamic HOI setting.
Introduces DynaHOI-Gym and DynaHOI-10M, a novel benchmark and dataset for evaluating hand motion generation in dynamic hand-object interaction scenarios.
The paper investigates the phenomenon of "benchmark illusion," where LLMs with similar benchmark accuracy exhibit significant disagreement on individual data points. Using MMLU-Pro and GPQA benchmarks, the authors quantify the disagreement rates between various LLMs, including top-performing frontier models. They demonstrate that this disagreement can lead to substantial variability in scientific research outcomes when LLMs are used for data annotation and inference, impacting the reproducibility of results.
Demonstrates that seemingly convergent benchmark accuracy among LLMs masks substantial disagreement on individual data points, leading to significant consequences for scientific reproducibility.
The paper introduces TIME, a new time series forecasting benchmark designed to address limitations in existing benchmarks related to data composition, integrity, task formulation, and analysis perspectives. TIME comprises 50 fresh datasets and 98 forecasting tasks constructed using a human-in-the-loop pipeline to ensure data integrity and real-world alignment. The benchmark also introduces a pattern-level evaluation perspective based on structural time series features to provide generalizable insights into model capabilities, and the authors evaluate 12 TSFMs on TIME.
Introduces TIME, a novel task-centric time series forecasting benchmark with enhanced data integrity, real-world task formulations, and pattern-level evaluation.
This paper investigates GPT-5's ability to learn Idris, a functional programming language, through iterative prompting strategies. The authors found that zero-shot performance on Idris programming exercises was significantly lower than performance on Python and Erlang. By incorporating local compilation errors into the prompts, the authors achieved a substantial performance increase, solving 54 out of 56 problems.
Demonstrates that compiler-guided, error-driven iterative prompting significantly improves GPT-5's performance in a low-resource programming language.
The paper investigates modality arbitration in Audio-LLMs, revealing a strong bias towards text over audio when the two modalities conflict, even when audio quality is superior. Using the ALME benchmark, the authors demonstrate that Gemini 2.0 Flash exhibits significantly higher text dominance in audio-text conflicts compared to text-text conflicts. They propose that this text dominance arises from an asymmetry in arbitration accessibility rather than information content, and provide evidence through interventions like forced transcription and fine-tuning ablations.
Reveals and analyzes a significant text dominance bias in audio-LLMs during modality arbitration, attributing it to differences in representational accessibility rather than information content.
This paper introduces RooflineBench, a benchmarking framework for on-device LLMs based on the Roofline model, using operational intensity (OI) to unify architectural primitives and hardware constraints. They define an inference-potential region and introduce Relative Inference Potential to compare LLM efficiency on the same hardware. Empirical analysis reveals that sequence length significantly influences performance and OI, identifies OI regression with model depth, and demonstrates how structural refinements like M-LA can unlock inference potential.
Introduces RooflineBench, a novel benchmarking framework leveraging Roofline analysis and operational intensity to evaluate and optimize on-device LLM performance across diverse hardware platforms.
The paper investigates the failure of speech recognition models on transcribing U.S. street names, finding a 44% error rate across 15 models from major vendors and disproportionately larger routing distance errors for non-English primary speakers. It highlights the gap between benchmark performance and real-world reliability, particularly for high-stakes tasks involving named entities. The authors then demonstrate that fine-tuning with a small, synthetically generated dataset of diverse pronunciations improves street name transcription accuracy by nearly 60% for non-English primary speakers.
Demonstrates that speech recognition models exhibit significant transcription errors on street names, particularly impacting non-English speakers, and mitigates this issue through synthetic data augmentation.
This paper introduces Microarchitecture Cliffs, a benchmark generation methodology to identify and attribute microarchitectural mismatches between architectural simulators and RTL implementations for model calibration. The Cliff methodology generates benchmarks that isolate individual microarchitectural features, enabling precise attribution of behavioral differences. Applying this methodology to calibrate XS-GEM5 against XS-RTL, the authors reduced performance error on Cliff benchmarks from 59.2% to 1.4% and improved performance prediction accuracy on SPEC2017 benchmarks.
Introduces a novel benchmark generation methodology, Microarchitecture Cliffs, for isolating and attributing microarchitectural discrepancies between simulators and RTL implementations, significantly improving simulator calibration accuracy.
The paper addresses the problem of detecting training data contamination in Reinforcement Learning with Verifiable Rewards (RLVR) fine-tuned reasoning models, where standard likelihood-based detection methods are ineffective. They observe that RLVR training leads to a structural convergence in the model's generations for seen prompts, resulting in more rigid and similar outputs compared to unseen prompts. They introduce Min-$k$NN Distance, a black-box detector that leverages this convergence by measuring the average of the $k$ smallest nearest-neighbor edit distances between multiple completions of a given prompt.
Introduces Min-$k$NN Distance, a novel black-box detector, to identify RLVR training data by quantifying the structural convergence of reasoning trajectories induced by RLVR.
The paper introduces MuRGAt, a new benchmark for evaluating fact-level multimodal attribution in complex reasoning scenarios involving video, audio, and other modalities. MuRGAt requires models to generate answers with explicit reasoning and precise citations that specify modality and temporal segments. The authors also present an automatic evaluation framework that correlates with human judgments, revealing that current MLLMs often hallucinate citations even with correct reasoning, and that increasing reasoning depth can degrade attribution accuracy.
Introduces MuRGAt, a challenging benchmark and automatic evaluation framework for fact-level multimodal attribution that exposes limitations in current MLLMs' ability to ground reasoning in heterogeneous input sources.
The paper investigates how to best pretrain small language models (SLMs) to decide which tokens to predict directly versus delegating to an external source via a special token. They find that loss alone is insufficient for determining optimal delegation, as some high-loss tokens represent acceptable alternative continuations. They introduce LaCy, a pretraining method that uses a spaCy grammar parser to augment the loss signal, enabling SLMs to learn when to delegate and resulting in improved FactScore in cascaded generation setups compared to other methods.
Introduces LaCy, a pretraining method that leverages a spaCy grammar parser to augment the loss signal, enabling SLMs to learn when to delegate token prediction to an external source.
This paper investigates the ability of Large Language Models (LLMs) to adapt to language variations across different socioeconomic status (SES) communities by comparing LLM-generated text completions with original text from a novel Reddit and YouTube dataset stratified by SES. The study analyzes 94 sociolinguistic features to assess the degree of stylistic adaptation exhibited by four LLMs. Results indicate that LLMs show limited stylistic modulation with respect to SES, often producing approximations or caricatures, and demonstrate a bias towards emulating upper SES styles, highlighting the risk of amplifying linguistic hierarchies.
Reveals that LLMs exhibit limited stylistic adaptation across socioeconomic strata and tend to favor upper SES linguistic styles, raising concerns about perpetuating linguistic biases.
This paper investigates the impact of underspecified questions on QA performance, finding that a significant portion of questions in standard QA benchmarks are underspecified. They introduce an LLM-based classifier to identify these questions and demonstrate that LLMs perform worse on them. Through a controlled rewriting experiment, they show that rewriting underspecified questions into fully specified variants, while keeping the gold answers fixed, consistently improves QA performance.
Demonstrates that question underspecification is a significant confound in QA evaluation by showing that rewriting underspecified questions improves QA performance.
The paper addresses object hallucination in Multimodal Large Language Models (MLLMs) by improving visual contrastive decoding (VCD) through the creation of an object-aligned auxiliary view. This auxiliary view is constructed by masking the most salient visual evidence based on object-centric attention from self-supervised Vision Transformers, thereby disrupting unsupported tokens during decoding. The proposed method, "Mask What Matters," is prompt-agnostic, model-agnostic, and computationally efficient, leading to improved performance on object hallucination benchmarks.
Introduces a novel object-aligned visual contrastive decoding method that masks salient visual features to mitigate object hallucinations in MLLMs.
The authors introduce ADRD-Bench, a new benchmark dataset for evaluating LLMs on Alzheimer's Disease and Related Dementias (ADRD), comprising a unified QA set from existing medical benchmarks and a novel QA set derived from the Aging Brain Care (ABC) program. They aim to address the lack of ADRD-specific evaluation resources and practical caregiving context in existing benchmarks. Evaluating 33 state-of-the-art LLMs, they found that while some models achieve high accuracy, inconsistencies in reasoning quality and stability remain a significant limitation.
Introduces ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs, incorporating both unified clinical knowledge and practical caregiving questions.
This paper introduces an online reinforcement learning (RL) approach to improve the high-performance computing (HPC) code generation capabilities of large language models (LLMs) by using runtime performance (GFLOPS) on a supercomputer as a direct reward signal. They propose a Staged Quality-Diversity (SQD) algorithm that progressively varies optimization techniques to encourage diverse learning. The authors trained Qwen2.5 Coder 14B on a double-precision matrix multiplication task using Group Relative Policy Optimization (GRPO), demonstrating improved HPC code generation.
Demonstrates that online reinforcement learning with real-machine benchmark rewards and staged optimization significantly improves the HPC code generation performance of LLMs.
This paper introduces a French-focused benchmark for PDF-to-Markdown conversion using VLMs, addressing the lack of evaluation datasets for non-English documents and the over-penalization of formatting variations in existing benchmarks. The benchmark consists of challenging French documents selected via model-disagreement sampling and is evaluated using unit-test-style checks targeting specific failure modes like text presence and reading order, combined with category-specific normalization. Results across 15 models show that proprietary models exhibit higher robustness on handwriting and forms, while open-weight models are competitive on standard layouts.
Introduces a new French-language PDF-to-Markdown benchmark with targeted unit tests and category-specific normalization to more accurately assess VLM performance in RAG pipelines.
The paper introduces PatientHub, a unified framework to standardize the creation, composition, and deployment of simulated patients for training counselors and scaling therapeutic assessment using Large Language Models. PatientHub addresses the fragmentation in existing patient simulation approaches by providing standardized data formats, prompts, and evaluation metrics, thus improving reproducibility and enabling fair comparisons. The authors demonstrate PatientHub's utility through case studies, showcasing standardized cross-method evaluation, seamless integration of custom evaluation metrics, and the prototyping of new simulator variants.
Introduces PatientHub, a modular framework that unifies patient simulation by standardizing data formats, prompts, and evaluation metrics to facilitate reproducibility and fair comparison of different methods.
The paper introduces V-SHiNE, a browser-based virtual smart home environment designed to facilitate the evaluation of explainable AI (XAI) methods in the context of smart home automation. V-SHiNE enables researchers to configure realistic smart home environments, simulate user behaviors, integrate custom explanation engines, and log user interactions. A user study with 159 participants demonstrates the framework's utility for assessing the impact and quality of different explanation strategies.
Introduces V-SHiNE, a novel browser-based simulation framework, to enable scalable and reproducible evaluation of XAI methods within virtual smart home environments.
This paper investigates the overlap between code review comments generated by human reviewers and those produced by ChatGPT-4, focusing on the types of quality improvements recommended. The authors manually classified 739 human-generated comments from 240 pull requests and compared them to ChatGPT-4's recommendations on the same PRs. Results indicate that while ChatGPT-4 suggests more changes overall, it only identifies 10% of the issues flagged by humans, though 40% of ChatGPT-4's additional suggestions are valuable, highlighting the complementary nature of both approaches.
Quantifies the overlap and differences in quality improvement recommendations between human code reviewers and ChatGPT-4, revealing the strengths and weaknesses of each approach.
The paper introduces CM2, a reinforcement learning framework that utilizes checklist rewards instead of verifiable outcome rewards to train agents for multi-turn, multi-step tool use. CM2 decomposes each turn's behavior into fine-grained binary criteria with evidence grounding, enabling more stable classification-style reward signals. Experiments in an LLM-simulated tool environment demonstrate that CM2 significantly outperforms supervised fine-tuning baselines on benchmarks like tau^-Bench, BFCL-V4, and ToolSandbox, achieving comparable or superior performance to similarly sized open-source models.
This paper introduces a novel reinforcement learning framework, CM2, that replaces traditional verifiable rewards with checklist-based rewards for training agents to effectively use tools in multi-turn, multi-step interactions.
The authors introduce ExtractBench, a new benchmark and evaluation framework for end-to-end PDF-to-JSON structured extraction, designed to address the lack of comprehensive benchmarks and principled evaluation methodologies for complex, nested extraction tasks. ExtractBench comprises 35 PDF documents paired with JSON Schemas and human-annotated gold labels across diverse domains, resulting in 12,867 evaluatable fields with varying schema complexities. Evaluations using ExtractBench reveal that state-of-the-art LLMs struggle with realistic schemas, particularly as schema breadth increases, with some models achieving 0% valid output on a 369-field schema.
Introduces ExtractBench, a novel benchmark and evaluation framework, to address the limitations of existing methods in evaluating complex structured extraction from PDFs using LLMs.
The paper introduces PRIME, a new benchmark designed to evaluate verifiers for process-outcome alignment in mathematical and engineering problem-solving, addressing the limitations of outcome-centric verification methods in Reinforcement Learning with Verifiable Rewards (RLVR). PRIME consists of 2,530 high-difficulty STEM problems and is used to demonstrate that existing verifiers often fail to identify flaws in the derivation process. The authors show that RLVR training using verifiers selected based on PRIME significantly improves performance on challenging math problem sets, and that PRIME's accuracy strongly correlates with RLVR training effectiveness.
Introduces PRIME, a novel benchmark for evaluating the ability of verifiers to align the reasoning process with the final outcome in complex STEM problems.
This paper introduces USE24-XD, a dataset of approximately 100,000 social media posts from X related to the 2024 U.S. presidential election, categorized into five harmful content types using a "wisdom of the crowd" approach with six LLMs. The study validates LLM annotations against human crowdsourcing, finding comparable agreement and high recall for specific categories like Speculation. Analysis of human annotator demographics reveals systematic biases in labeling harmful content, underscoring the subjectivity inherent in such judgments.
Introduces USE24-XD, a large-scale, multi-labeled dataset of election-related social media content annotated by LLMs and validated by human annotators, to facilitate research on harmful online narratives.
This paper investigates whether GPT-4o possesses a genuine Theory of Mind (ToM) by evaluating its ability to model the causal relationship between mental states and behavior. The authors developed a novel evaluation framework based on a cognitively-grounded definition of ToM, probing for coherence, domain-generality, and consistency in the model's understanding of mental state causality. The key finding is that while GPT-4o can approximate human judgments in simple ToM tasks, it fails on logically equivalent tasks and demonstrates low consistency between predicted actions and inferred mental states, suggesting a lack of a robust ToM.
Demonstrates that GPT-4o, despite apparent social proficiency, lacks a coherent, domain-general, and consistent Theory of Mind by revealing inconsistencies in its mental state inferences and action predictions.
This paper introduces a semi-automated pipeline for extracting Subject-Predicate-Object triplets from financial reports using LLMs, addressing the lack of ground truth data by employing ontology-driven proxy metrics like Ontology Conformance and Faithfulness. The authors compare a manually engineered ontology with a document-specific, automatically induced ontology, finding that the latter achieves 100% schema conformance and eliminates ontology drift. They also propose a hybrid verification strategy combining regex matching and LLM-as-a-judge to reduce subject hallucination rates, and identify asymmetries in subject/object hallucinations.
Introduces a semi-automated pipeline for LLM-based triplet extraction from financial reports evaluated using ontology-driven proxy metrics, circumventing the need for annotated ground truth.
This paper analyzes predictive multiplicity, the phenomenon of multiple AI models with similar overall accuracy disagreeing on individual predictions, in the context of the EU AI Act. It argues that high predictive multiplicity violates the Act's requirements for individual-level performance reporting, as it introduces arbitrariness in decisions impacting humans. The paper proposes individual conflict ratios and $\delta$-ambiguity as metrics to quantify disagreement between models on individual cases and offers practical guidelines for model providers to evaluate and report predictive multiplicity.
Proposes individual conflict ratios and $\delta$-ambiguity as metrics to quantify predictive multiplicity and facilitate compliance with the EU AI Act's accuracy provisions.
The paper introduces CLUES, a framework that decomposes semantic uncertainty in Text-to-SQL into ambiguity and instability scores to differentiate between input ambiguity requiring clarification and model instability requiring human review. CLUES models Text-to-SQL as a two-stage process of interpretations to answers and computes instability using the Schur complement of a bipartite semantic graph matrix. Experiments on AmbigQA/SituatedQA and a clinical Text-to-SQL benchmark demonstrate that CLUES improves failure prediction compared to Kernel Language Entropy and provides diagnostic decomposition for targeted interventions.
Introduces CLUES, a novel framework that decomposes semantic uncertainty in Text-to-SQL into ambiguity and instability scores, enabling targeted interventions for query refinement and model improvement.
The paper introduces Selective Abstraction (SA), a framework for improving the reliability of long-form text generation by allowing LLMs to selectively reduce the specificity of uncertain content instead of abstaining entirely. They formalize SA using selective risk and coverage metrics and propose Atom-wise Selective Abstraction, which decomposes responses into atomic claims and replaces uncertain claims with more general abstractions. Empirical evaluation on FactScore and LongFact-Objects benchmarks demonstrates that Atom-wise SA significantly improves the risk-coverage trade-off compared to claim removal, boosting AURC by up to 27.73% across six open-source models.
Introduces Selective Abstraction, a novel framework enabling LLMs to trade specificity for reliability in long-form generation by selectively abstracting uncertain content.
The paper introduces WebTestPilot, an LLM-based agent for end-to-end web testing against natural language specifications that addresses the challenges of implicit oracle inference and probabilistic reasoning. WebTestPilot uses a symbolization layer to represent GUI elements as symbols and translates natural language into step-by-step instructions with inferred pre- and post-conditions over these symbols, effectively capturing data, temporal, and causal dependencies for validation. Experiments on a new benchmark of bug-injected web applications demonstrate that WebTestPilot achieves a 99% task completion rate with 96% precision and 96% recall in bug detection, significantly outperforming existing LLM-based approaches.
Introduces a novel approach to end-to-end web testing by inferring oracles with symbolized GUI elements, enabling the agent to validate implicit requirements and improve bug detection accuracy.
The paper investigates whether neural world models truly learn physical laws or rely on statistical shortcuts, particularly under out-of-distribution shifts. They introduce PhyIP, a non-invasive evaluation protocol that assesses the linear decodability of physical quantities from frozen latent representations, contrasting it with adaptation-based methods. Their results show that when self-supervised learning achieves low error, latent physical structures are linearly accessible and robust to OOD shifts, while adaptation-based evaluations can collapse this structure, suggesting that non-invasive probes are more accurate for evaluating physical world models.
Introduces PhyIP, a non-invasive evaluation protocol, to accurately assess the linear accessibility of physical quantities in frozen latent representations of world models, demonstrating its superiority over adaptation-based methods.
The paper introduces Distribution Map (DMAP), a novel method for representing text using next-token probability distributions from LLMs by mapping text to samples in the unit interval that encode rank and probability. DMAP addresses the limitations of perplexity by accounting for context and the shape of the conditional distribution. The authors demonstrate DMAP's utility in validating generation parameters, detecting machine-generated text via probability curvature, and performing forensic analysis of models fine-tuned on synthetic data.
Introduces DMAP, a mathematically grounded method for representing text as a distribution of samples in the unit interval based on next-token probability distributions from LLMs, enabling efficient and model-agnostic text analysis.
This paper investigates the use of LLMs (Claude Sonnet 4.5 and GPT-5.2) for co-evolving textual Domain-Specific Languages (DSLs) and their instances when grammars change, addressing the limitations of traditional model-driven engineering techniques in preserving human-relevant information. The study systematically evaluates the correctness and information preservation capabilities of these LLMs across ten case languages and multiple runs, varying the scale and complexity of the grammar evolutions. Results indicate high performance on small-scale instances but a significant performance degradation with increasing instance size and grammar evolution complexity, highlighting current limitations in LLM-based co-evolution for larger and more complex DSLs.
Systematically evaluates the capabilities of LLMs, specifically Claude Sonnet 4.5 and GPT-5.2, in co-evolving textual DSL grammars and instances, quantifying their performance with respect to correctness, information preservation, and scalability.
The paper introduces Sci-CoE, a two-stage co-evolution framework for scientific reasoning LLMs that transitions from sparse supervision to unsupervised learning. Sci-CoE uses a small labeled dataset to bootstrap a Verifier and then employs a geometric reward mechanism incorporating consensus, reliability, and diversity to drive self-iteration on unlabeled data. Experiments on scientific benchmarks demonstrate that Sci-CoE improves complex reasoning capabilities and evaluation robustness.
Introduces a geometric reward mechanism that jointly considers consensus, reliability, and diversity to drive the co-evolution of scientific reasoning LLMs in an unsupervised manner.
This paper introduces the Value Alignment Tax (VAT), a framework to quantify how aligning LLMs to specific values impacts the broader value system. VAT measures the trade-offs between gains in target value alignment and changes in other interconnected values. Using a dataset of scenario-action pairs grounded in Schwartz value theory, the authors demonstrate that alignment interventions induce structured co-movement among values, which are often missed by target-only evaluations.
Introduces the Value Alignment Tax (VAT) framework to quantify and analyze the systemic effects of value alignment interventions in LLMs.
The paper introduces AmbiBench, a new benchmark designed to evaluate mobile GUI agents' ability to handle ambiguous instructions and engage in interactive intent alignment, moving beyond the limitations of existing benchmarks that focus on one-shot, complete instructions. The benchmark is structured around a taxonomy of instruction clarity levels (Detailed, Standard, Incomplete, Ambiguous) based on Cognitive Gap theory and includes 240 real-world tasks across 25 applications. The authors also present MUSE, an automated evaluation framework using an MLLM-as-a-judge multi-agent architecture, demonstrating its utility in assessing agent performance across different clarity levels and its correlation with human judgment.
Introduces AmbiBench, a novel benchmark for evaluating mobile GUI agents on their ability to handle ambiguous instructions and engage in interactive intent alignment, along with an automated evaluation framework called MUSE.
This paper introduces the GUI Agent Autonomy Levels (GAL) framework, a six-level scale for classifying the autonomy of GUI agents interacting with software. The framework aims to clarify the varying degrees of autonomy currently attributed to GUI agents, addressing ambiguity in capability, responsibility, and risk. By providing a standardized benchmark, GAL facilitates progress towards more trustworthy software interaction.
Proposes the GUI Agent Autonomy Levels (GAL) framework to categorize and benchmark the autonomy of GUI agents.
The paper introduces Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework for assessing LLM safety under repeated inference, addressing the limitations of breadth-oriented benchmarks. APST models safety failures as stochastic outcomes using Bernoulli and binomial models to estimate per-inference failure probabilities under controlled operational conditions like decoding temperature. Experiments on instruction-tuned LLMs using AIR-BENCH-derived safety prompts reveal that models with similar benchmark scores can exhibit significantly different empirical failure rates under repeated sampling, especially with increased temperature, highlighting the importance of evaluating reliability under sustained use.
Introduces Accelerated Prompt Stress Testing (APST), a novel framework for evaluating LLM safety and reliability by repeatedly sampling identical prompts to surface latent failure modes and quantify per-inference failure probabilities.

