March 4 – March 11, 2026

Eval Frameworks & Benchmarks - Weekly Roundup

100 papers published across 9 labs.

14% acceleration

Selected Labs publishing this week

CMU ML3 Google Research2 Microsoft Research2 DeepMind2 Tsinghua AI1

Top Papers

Mar 11, 2026

3w ago·also CWI Amsterdam, Datadog, Erasmus University Rotterdam, Leiden

TOSSS: a CVE-based Software Security Benchmark for Large Language Models

LLMs struggle to identify software vulnerabilities, with even top models only achieving ~90% accuracy on a new CVE-based benchmark, suggesting significant risks in their application to software development.

Marc Damie, Murat Bilgehan Ertan, Domenico Essoussi +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Mingmeng Geng +43w ago

Markovian Generation Chains in Large Language Models

Iteratively prompting LLMs can either collapse diversity or maintain novelty, revealing a sensitivity to temperature and initial conditions that has implications for multi-agent systems.

Mingmeng Geng, Amr Mohamed, Guokan Shang +2

Eval Frameworks & Benchmarks Natural Language Processing

Yangfan He +23w ago

Are Video Reasoning Models Ready to Go Outside?

Video reasoning models can suffer up to a 35% drop in accuracy and 28% in reasoning quality under real-world perturbations, but a new training framework, ROVA, mitigates this by adaptively prioritizing informative samples.

Yangfan He, C. Boo, Jaehong Yoon

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Xiangwen Wang +23w ago

Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

Prompt-based jailbreak attacks aren't just effective, they're shockingly efficient, outperforming optimization-based methods by more effectively navigating the prompt space.

Xiangwen Wang, Ananth Balashankar, Varun Chandrasekaran

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Scaling Laws & Emergent Abilities

Nishat Raihan +13w ago

Temporal Text Classification with Large Language Models

Despite their general prowess, open-source LLMs still lag behind proprietary models in the nuanced task of dating texts, even after fine-tuning.

Nishat Raihan, Marcos Zampieri

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

All Papers (100)

Mar 11, 2026

3w ago·also CWI Amsterdam, Datadog, Erasmus University Rotterdam, Leiden

TOSSS: a CVE-based Software Security Benchmark for Large Language Models

Marc Damie, Murat Bilgehan Ertan, Domenico Essoussi +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Mingmeng Geng +43w ago

Markovian Generation Chains in Large Language Models

Iteratively prompting LLMs can either collapse diversity or maintain novelty, revealing a sensitivity to temperature and initial conditions that has implications for multi-agent systems.

Mingmeng Geng, Amr Mohamed, Guokan Shang +2

Eval Frameworks & Benchmarks Natural Language Processing

Yangfan He +23w ago

Are Video Reasoning Models Ready to Go Outside?

Yangfan He, C. Boo, Jaehong Yoon

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Xiangwen Wang +23w ago

Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

Prompt-based jailbreak attacks aren't just effective, they're shockingly efficient, outperforming optimization-based methods by more effectively navigating the prompt space.

Xiangwen Wang, Ananth Balashankar, Varun Chandrasekaran

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Scaling Laws & Emergent Abilities

Nishat Raihan +13w ago

Temporal Text Classification with Large Language Models

Despite their general prowess, open-source LLMs still lag behind proprietary models in the nuanced task of dating texts, even after fine-tuning.

Nishat Raihan, Marcos Zampieri

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Yuanbo Hou +53w ago

Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context

Geospatial context is a surprisingly effective prior for audio tagging, especially when sounds are acoustically similar, leading to improved performance over audio-only methods.

Yuanbo Hou, Yanru Wu, Qiaoqiao Ren +3

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

3w ago

ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions

Even the best LLMs struggle with multi-turn medical dialogues, with error rates tripling by the third turn and a single wrong answer significantly increasing the probability of subsequent errors.

Monica Munnangi, Saiph Savage

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

A. Trybus +23w ago

Making Bielik LLM Reason (Better): A Field Report

Can a dedicated research program keep a smaller, local LLM competitive against global giants in the rapidly evolving AI landscape?

A. Trybus, Bartosz Bartnicki, Remigiusz Kinas

Eval Frameworks & Benchmarks Open-Source Models & Weights Reasoning & Chain-of-Thought

Lingxiao Tang +63w ago

ExecVerify: White-Box RL with Verifiable Stepwise Rewards for Code Execution Reasoning

A 7B model, guided by verifiable execution rewards, can now rival the code reasoning of models more than four times its size.

Lingxiao Tang, He Ye, Zhaoyang Chu +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

3w ago

Multilingual Reasoning Gym: Multilingual Scaling of Procedural Reasoning Environments

Unlock massive multilingual reasoning data: the Multilingual Reasoning Gym enables parallel data generation across 14 languages, opening doors for training and evaluating multilingual reasoning models at scale.

Konstantin Dobler, Simon Lehnerer, Federico Scozzafava +2

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Lin Zhang +73w ago

Can LLMs Help Localize Fake Words in Partially Fake Speech?

LLMs can spot fake words in speech by recognizing common editing patterns, but this reliance on learned biases hinders generalization to new manipulation techniques.

Lin Zhang, Thomas Thebaud, Zexin Cai +5

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Mingyang Song +23w ago

Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

LLM-as-a-judge consensus is often an illusion: models agree on surface-level features, but diverge wildly when evaluating true quality, a problem fixable by injecting domain knowledge into rubrics.

Mingyang Song, Mao Zheng, Chenning Xu

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Jesse Yu +13w ago

The Orthogonal Vulnerabilities of Generative AI Watermarks: A Comparative Empirical Benchmark of Spatial and Latent Provenance

Single-domain watermarks are fundamentally insufficient against modern adversarial toolsets, as spatial and latent watermarks exhibit orthogonal vulnerabilities to generative and geometric attacks, respectively.

Jesse Yu, Nicholas Wei

Eval Frameworks & Benchmarks Open-Source Models & Weights Red-Teaming & Adversarial Robustness

Chuanlong Zang +53w ago

GRACE: A Unified 2D Multi-Robot Path Planning Simulator&Benchmark for Grid, Roadmap, And Continuous Environments

Finally, a multi-robot path planning benchmark that lets you directly compare grid-based, roadmap, and continuous planners on the same tasks.

Chuanlong Zang, Anna Mannucci, Isabelle Barz +3

Eval Frameworks & Benchmarks Robotics & Embodied AI World Models & Planning

Yonas Atinafu +23w ago

Improving LLM Performance Through Black-Box Online Tuning: A Case for Adding System Specs to Factsheets for Trusted AI

Maximize your LLM's goodput without diving into its internals: a new black-box controller uses hill climbing on end-to-end measurements to optimize performance.

Yonas Atinafu, Henry Lin, Robin Cohen

Distributed Systems & Hardware Eval Frameworks & Benchmarks Inference & Quantization

3w ago

Unveiling Practical Shortcomings of Patch Overfitting Detection Techniques

Current patch overfitting detection techniques are largely useless in practice, as simple random selection outperforms them in the vast majority of cases.

David Williams, Ioakim Avraam, A. Aleti +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Kadir-Kaan Özer +23w ago

ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection

Accuracy leaderboards mislead: lightweight classical anomaly detectors surprisingly outperform deep methods when deployed under the throughput constraints of in-vehicle monitoring systems.

Kadir-Kaan Özer, René Ebeling, Markus Enzweiler

Distributed Systems & Hardware Eval Frameworks & Benchmarks Inference & Quantization

Fabrizio Dimino +23w ago

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

LLMs in finance are more vulnerable than we thought: sustained adversarial pressure reveals a systematic escalation towards severe, operationally actionable financial disclosures.

Fabrizio Dimino, Bhaskarjit Sarmah, Stefano Pasquali

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

3w ago·also Nava Labs

LLMs in social services: How does chatbot accuracy affect human accuracy?

Beware the "AI underreliance plateau": even highly accurate LLM chatbots can only improve human caseworker accuracy so much, and incorrect suggestions can tank performance on easy questions.

Jennah Gosciak, Eric Giannella, Zhaowen Guo +2

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

CMU ML3w ago

RCTs&Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI Evaluation

Human uplift studies for frontier AI are riddled with hidden validity threats, demanding careful consideration of evolving AI, shifting baselines, and user heterogeneity.

Patricia Paskov, Kevin Wei, Shengxin Hong +7

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Kansas State University3w ago·also NYU, TU Munich

Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation: Quality, Reliability, and Failure Modes

LLMs generating hardware code often fail *after* synthesis, and the type of failure (elaboration errors vs. missing wrappers) systematically depends on whether the model is proprietary or open-weight.

Weimin Fu, Zeng Wang, Minghao Shao +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

3w ago

mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR

Multilingual math reasoning just got a serious upgrade: mAceReason-Math offers a meticulously translated and cleaned dataset of challenging problems across 14 languages, purpose-built for RLVR training.

Konstantin Dobler, Simon Lehnerer, Federico Scozzafava +2

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Sid Wang +23w ago·also Vrije Universiteit Amsterdam

Large Language Models as Annotators for Machine Translation Quality Estimation

Forget expensive LLM inference for MTQE: train a COMET model on GPT-4o-generated annotations and get competitive performance.

Sid Wang, Sophie Arnoult, Amir Kamran

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

3w ago

Security-by-Design for LLM-Based Code Generation: Leveraging Internal Representations for Concept-Driven Steering Mechanisms

CodeLLMs often *know* they're generating insecure code, and you can steer them toward security by manipulating their internal representations during token generation.

Maximilian Wendlinger, Daniel Kowatsch, Konstantin Bottinger +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

University of Nebraska-Lincoln3w ago

STADA: Specification-based Testing for Autonomous Driving Agents

Achieve 2x better coverage of autonomous driving safety requirements with 6x fewer simulations by automatically generating test scenarios from formal LTLf specifications.

Joy Saha, Trey Woodlief, Sebastian Elbaum +1

Eval Frameworks & Benchmarks Robotics & Embodied AI World Models & Planning

3w ago·also IBM Research

RAGPerf: An End-to-End Benchmarking Framework for Retrieval-Augmented Generation Systems

Pinpointing performance bottlenecks in RAG pipelines just got easier: RAGPerf offers a modular benchmarking framework to dissect and optimize each component.

Shaobo Li, Y. Zhou, Yuan Xu +5

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Thomas Thebaud +43w ago

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

Speech-aware LLMs are surprisingly bad at speaker verification, but a simple embedding injection trick closes the gap with dedicated systems while preserving the LLM's language abilities.

Thomas Thebaud, Yuzhe Wang, L. Moro-Velázquez +2

Eval Frameworks & Benchmarks Open-Source Models & Weights Speech & Audio

3w ago·also BlockSec

Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?

AI agents can detect smart contract vulnerabilities, but don't expect them to autonomously exploit real-world security incidents anytime soon.

Chaoyuan Peng, Lei Wu, Yajin Zhou

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

3w ago

WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics

Multimodal LLMs still struggle to faithfully recreate webpages from videos, particularly in capturing fine-grained style and motion, despite advances in other areas.

Yuhong Dai, Yanlin Lai, Mitt Huang +9

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Mar 10, 2026

Tsinghua AI3w ago·also Beihang, York

An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

LLMs in collaborative coding often stumble on interaction subtleties, leading to a new class of problems called "Interaction Smells" that can now be systematically identified and mitigated.

Binquan Zhang, Li Zhang, Lin Shi +6

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Stanford HAI3w ago

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

Current Large Audio Language Models (LALMs) struggle with basic audio understanding tasks like noise localization and cross-lingual speech, with some performing worse than random chance, despite excelling at speech recognition.

Laya Iyer, Sanmi Koyejo

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

3w ago

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

Sports expose surprising limitations in VLMs' spatial reasoning, as current models struggle to generalize from existing benchmarks despite fine-tuning gains on a new, large-scale dataset.

Yuchen Yang, Yuqing Shao, Duxiu Huang +14

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Tengjin Weng +33w ago

OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

Even the most advanced MLLMs like GPT-5 and Gemini struggle to spot the "odd one out" in simple visual grids, revealing a surprising weakness in fine-grained visual perception.

Tengjin Weng, Jingyi Wang, Lin Ma +1

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Zuhao Zhang +63w ago

MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

LLMs still struggle to generate high-quality interactive HTML applications, despite their advancements in code generation, highlighting a gap that MiniAppBench aims to address.

Zuhao Zhang, Chengyue Yu, Yuante Li +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Bochra Al Agha +13w ago

Benchmarking Dataset for Presence-Only Passive Reconnaissance in Wireless Smart-Grid Communications

Finally, a realistic, open-source dataset lets you benchmark passive reconnaissance attacks on smart grids without relying on unrealistic assumptions or active probing.

Bochra Al Agha, Razane Tajeddine

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Federico Bello +43w ago

GNNs for Time Series Anomaly Detection: An Open-Source Framework and a Critical Evaluation

GNNs don't just detect time series anomalies better, they also offer a crucial interpretability boost for real-world diagnosis.

Federico Bello, Gonzalo Chiarlone, Marcelo Fiori +2

Eval Frameworks & Benchmarks Open-Source Models & Weights

Bioaligned Labs3w ago·also Lawrence Berkeley National Lab, UC Riverside

Bioalignment: Measuring and Improving LLM Disposition Toward Biological Systems for AI Safety

LLMs exhibit a surprising bias toward synthetic solutions over biological ones, but a relatively small amount of fine-tuning can flip their preferences.

Trent R Northen, Mingxun Wang

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Xin Jing +53w ago

EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions

Tired of LLM judges hallucinating when evaluating long, detailed speech captions? EmoSURA offers a more reliable, audio-grounded alternative by verifying atomic perceptual units.

Xin Jing, Andreas Triantafyllopoulos, Jiadong Wang +3

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Dechuan Teng +33w ago

ESAinsTOD: A Unified End-to-End Schema-Aware Instruction-Tuning Framework for Task-Oriented Dialog Modeling

Forget dataset-specific hacks: ESAinsTOD leverages instruction and schema alignment to achieve state-of-the-art task-oriented dialogue performance with strong generalization, even in low-resource settings.

Dechuan Teng, Chunlin Lu, Libo Qin +1

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Yun-Shao Tsai +73w ago

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

LALMs struggle to handle multiple concurrent audio inputs, but a simple input permutation strategy can significantly boost their multi-audio understanding without retraining.

Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai +5

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

National Centre for Scientific Research3w ago

AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems

Now you can test if your AI system is ready for the EU AI Act, thanks to a new benchmark that combines legal expertise and LLM-generated scenarios.

Athanasios Davvetas, Michael Papademas, Xenia Ziouvelou +1

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

3w ago

OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks

VLMs still struggle to understand our planet, as revealed by a new geospatial benchmark spanning diverse Earth observation tasks and multi-source sensing data.

Ronghao Fu, Haoran Liu, Weijie Zhang +2

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Google Research3w ago·also AI2, TAU

Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs

Reasoning unlocks factual knowledge in LLMs, but beware: hallucinated reasoning steps can poison the well.

Zorik Gekhman, Roee Aharoni, E. Ofek +4

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

3w ago

CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning

Finally, a standardized benchmark to rigorously evaluate how well models generalize carbon flux predictions to geographically distinct ecosystems they've never seen before.

Aleksei Rozanov, Arvind Renganathan, Yimeng Zhang +1

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design

Fredrik K. Gustafsson +43w ago

SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG

Domain-specific biosignal foundation models, fused with multimodal ECG and PPG data, substantially outperform general time-series models on clinically relevant tasks, but bigger isn't always better.

Fredrik K. Gustafsson, Xiao Gu, Mattia Carletti +2

Eval Frameworks & Benchmarks Multimodal Models Scientific Discovery & Drug Design

Aman Sharma +13w ago

EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

LLMs that ace standard coding benchmarks spectacularly fail at esoteric languages, revealing a reliance on memorization rather than true reasoning.

Aman Sharma, Paras Chopra

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Alex R. Mattukat +23w ago

Can ChatGPT Generate Realistic Synthetic System Requirement Specifications? Results of a Case Study

Despite ChatGPT's known flaws, it can generate surprisingly realistic synthetic system requirement specifications that fool experts more often than you'd expect.

Alex R. Mattukat, F. Braun, Horst Lichter

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

3w ago

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

MLLMs still struggle to reliably predict the long-term consequences of actions in egocentric videos, even with structured scene annotations.

Chengjun Yu, Xuhan Zhu, Chaoqun Du +4

Eval Frameworks & Benchmarks Multimodal Models World Models & Planning

University of Pisa3w ago

Enhancing Debunking Effectiveness through LLM-based Personality Adaptation

LLMs can generate more persuasive fake news debunking messages by tailoring them to specific personality traits, as evaluated by LLM-simulated personas.

Pietro Dell'Oglio, Alessandro Bondielli, Francesco Marcelloni +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

3w ago

MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

Even GPT-5 struggles with multi-modal robustness and turn overhead when user personas and multi-modal inputs are considered in agent evaluation, revealing critical gaps in current LLM agent capabilities.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Saugata Purkayastha +33w ago

Common Sense vs. Morality: The Curious Case of Narrative Focus Bias in LLMs

LLMs often choose moral consistency over basic common sense, especially when the contradiction is committed by the main character in a narrative.

Saugata Purkayastha, Pranav Kushare, Pragya Paramita Pal +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Meta AI3w ago·also UT Austin

CREATE: Testing LLMs for Associative Creativity

LLMs struggle to generate diverse and specific connections between concepts, even with high token budgets and "thinking" prompts, revealing a gap in creative associative reasoning.

Manya Wadhwa, Tiasa Singha Roy, Harvey Lederman +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

NUS3w ago

MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems

Medical multi-agent systems can reason deeply, but fall apart when switching between medical specialties, highlighting a critical need for more robust architectures.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Charles University3w ago

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

Forget expensive human annotations: LLMs can reliably generate synthetic data to validate NLP evaluation metrics, even outperforming human agreement in some multilingual tasks.

Lukáš Eigler, Jindřich Libovický, David Hurych

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Maike Züfle +93w ago·also Fondazione Bruno Kessler

Do What I Say: A Spoken Prompt Dataset for Instruction-Following

Text prompts might be inflating your SLLM's performance: spoken prompts reveal a significant performance gap, especially in low-resource languages.

Maike Züfle, Maike Zufle, Sara Papi +7

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Speech & Audio

3w ago·also University College Dublin

Class Model Generation from Requirements using Large Language Models

LLMs can now generate UML diagrams from requirements with human-level quality, potentially automating a resource-intensive phase in software design.

Jackson Nguyen, Rui En Koe, Fanyu Wang +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

VNU University of Engineering and Technology3w ago

MissBench: Benchmarking Multimodal Affective Analysis under Imbalanced Missing Modalities

Multimodal models that seem robust can still fail when some modalities are systematically missing, a problem MissBench exposes with new metrics for modality equity and learning balance.

Tien Anh Pham, Phuong-Anh Nguyen, Duc-Trong Le +1

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Université catholique de Louvain3w ago·also Universiteit Antwerpen

No evaluation without fair representation : Impact of label and selection bias on the evaluation, performance and mitigation of classification models

Evaluating classification models on biased data can mask true performance and fairness, but this work provides a framework to create unbiased test sets that reveal the real impact of different biases and mitigation strategies.

Magali Legast, Toon Calders, François Fouss

Constitutional AI & AI Ethics Data Curation & Synthetic Data Eval Frameworks & Benchmarks

3w ago

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

MLLMs struggle with visually rendered text not because they can't reason, but because they can't *read* it, and a simple self-distillation fix closes the gap.

Kaiser Sun, Xiaochuang Yuan, Hongjun Liu +5

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Xing Chen +63w ago

Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments

LLMs that dominate in strategic reasoning often choke in real-time zero-sum games, revealing a critical strategy-execution gap that current benchmarks miss.

Xing Chen, Yutao Liu, Gege Qi +4

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Rongxiang Zeng +13w ago

Latent World Models for Automated Driving: A Unified Taxonomy, Evaluation Framework, and Open Challenges

Latent world models for automated driving are ripe for standardization, and this paper offers a taxonomy and evaluation framework to make them decision-ready.

Rongxiang Zeng, Yongqi Dong

Eval Frameworks & Benchmarks Robotics & Embodied AI World Models & Planning

3w ago

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health

LLMs exhibit gender bias in healthcare scenarios by relying on stereotypes when reasoning about patient records, revealing the need to evaluate interactions among social determinants of health to assess LLM performance and bias.

Trung Hieu Ngo, Adrien Bazoge, Solen Quiniou +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Xiaoyu Ding +23w ago

YOLO-NAS-Bench: A Surrogate Benchmark with Self-Evolving Predictors for YOLO Architecture Search

YOLO architecture search can now be sped up dramatically: a new surrogate benchmark lets you evaluate designs without full training, and it's good enough to find architectures that beat YOLOv12.

Xiaoyu Ding, Jiaxin Zheng, Yongtao Wang

Architecture Design (Transformers, SSMs, MoE)Computer Vision Eval Frameworks & Benchmarks

Yuyang Dai3w ago

Rescaling Confidence: What Scale Design Reveals About LLM Metacognition

LLMs' uncertainty estimates are highly sensitive to the design of the confidence scale, with a 0-20 scale boosting metacognitive efficiency compared to the standard 0-100.

Yuyang Dai

Eval Frameworks & Benchmarks Natural Language Processing

Jožef Stefan Institute3w ago

Evaluation of LLMs in retrieving food and nutritional context for RAG systems

LLMs can drastically reduce manual effort for domain experts in accessing complex food and nutrition data via RAG, but still struggle with queries that exceed the representational scope of the metadata.

Maks Požarnik Vavken, Matevž Ogrinc, Tome Eftimov +1

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

3w ago

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Stop wrestling with finicky evaluation codebases: One-Eval lets you specify LLM evaluation tasks in natural language and automatically executes them end-to-end.

Chengyu Shen, Yanheng Hou, Minghui Pan +8

Eval Frameworks & Benchmarks Tool Use & Agents

Shreyas Meher3w ago

Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models

Don't build a domain-specific model just because you can: fine-tuning a general-purpose model can achieve comparable performance on common tasks, saving significant resources.

Shreyas Meher

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Wanchun Li +43w ago

Quantifying and extending the coverage of spatial categorization data sets

LLMs can generate spatial relation labels that align with human judgments, offering a scalable path to richer, multilingual spatial datasets.

Wanchun Li, A. Carstensen, Yang Xu +2

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Microsoft Research3w ago

CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research?

LLMs still can't automate real-world threat research, struggling with accuracy and nuanced expertise in a new benchmark derived from a world-leading company's CTI workflow.

Xiangsen Chen, Xuan Feng, Shuo Chen +4

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Wang Honghui +63w ago

Beyond Short-Horizon: VQ-Memory for Robust Long-Horizon Manipulation in Non-Markovian Simulation Benchmarks

Forget pick-and-place: RuleSafe, a new benchmark featuring LLM-generated safe-cracking tasks, exposes the long-horizon planning weaknesses of current robot learning methods.

Wang Honghui, Jing Zhi, Ao Jicong +4

Eval Frameworks & Benchmarks Robotics & Embodied AI World Models & Planning

Microsoft Research3w ago

Overview of the TREC 2025 Retrieval Augmented Generation (RAG) Track

Can RAG systems handle complex, multi-sentence queries while maintaining factual grounding and transparency?

Shivani Upadhyay, Nandan Thakur, Ronak Pradeep +2

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

3w ago

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

Forget data quantity, diversity is the secret sauce: scaling the variety of tool-use patterns in training data boosts LLM generalization by +22 points on OOD benchmarks, even with 4x less data.

Aili Chen, Chi Zhang, Junteng Liu +11

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

AI23w ago·also (Corresponding author: Rui Meng and Xiaodong, TCS Research, Yale

RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation

Stop generating superficial reviews: RbtAct leverages rebuttals to train LLMs to provide actionable feedback, leading to concrete revisions and improved author uptake.

Sihong Wu, Yi Ma, Yiling Ma +6

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Syed Izzat Ullah +13w ago

NanoBench: A Multi-Task Benchmark Dataset for Nano-Quadrotor System Identification, Control, and State Estimation

Finally, a comprehensive dataset unlocks the potential to develop and validate advanced control and estimation algorithms tailored for the unique challenges of nano-quadrotors.

Syed Izzat Ullah, Jose Baca

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Robotics & Embodied AI

Aodi Wu +53w ago

SpaceSense-Bench: A Large-Scale Multi-Modal Benchmark for Spacecraft Perception and Pose Estimation

Training on more diverse synthetic spacecraft data dramatically improves generalization to novel satellite designs, but current methods still struggle to identify small, critical components like thrusters.

Aodi Wu, Jianhong Zuo, Zeyuan Zhao +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

3w ago

How Contrastive Decoding Enhances Large Audio Language Models?

Contrastive Decoding's power-up for audio language models hinges on fixing specific error types, like uncertainty and audio absence, but don't expect it to magically fix flawed reasoning.

Tzu-Quan Lin, Wei-Ping Huang, Yi-Cheng Lin

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

3w ago·also DFKI

Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

VLMs that excel at visual understanding can still fail at driving tasks requiring temporal reasoning, revealing an over-reliance on pretrained patterns instead of modeling dynamics.

Chun-Peng Chang, Chen-Yu Wang, Alain Pagani

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Google Research3w ago·also CMU ML, DeepMind

Think Before You Lie: How Reasoning Improves Honesty

LLMs get *more* honest when they have time to reason, defying human tendencies and revealing surprising insights about their internal representational geometry.

Ann Yuan, Asma Ghandeharioun, Carter Blum +6

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

3w ago

Benchmarking Political Persuasion Risks Across Frontier Large Language Models

Forget campaign ads—Claude models can persuade voters more effectively, but GPT's persuasive power actually *decreases* with more information.

Zhongren Chen, Joshua Kalla, Quan Le

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

3w ago

OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

MLLMs can be blind to the consequences of their actions, and simply scaling model size won't fix the problem.

Ming Wen, Kun Yang, Jingyu Zhang +2

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

CMU ML3w ago·also CAS

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

LLMs trained with reinforcement learning from verifiable rewards (RLVR) become overconfident in incorrect answers, but a simple fix—decoupling reasoning and calibration objectives—can restore proper calibration without sacrificing accuracy.

Zheng Ma, Zhengzhao Ma, Xueru Wen +9

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

Mar 9, 2026

3w ago

Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization

LLM-based judges, widely used for automated evaluation, are riddled with diverse biases that can be significantly reduced through bias-aware training using RL and contrastive learning.

Hongli Zhou, Hui Huang, Kehai Chen +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Incheon National University3w ago·also McGill

It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models

VLMs, despite their prowess, struggle with a seemingly simple task: reading analog clocks in real-world images, a gap this work closes with a new dataset and fine-tuning method.

Jaeha Choi, Jin Won Lee, Siwoo You +1

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

3w ago·also Cohere, Corresponding Author, Hunan, Tencent AI +2

FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use

LLMs struggle to navigate the complexities of real-world finance, as evidenced by a new benchmark revealing their limitations in timeliness, regulatory compliance, and tool selection across 760 financial APIs.

Jiaxuan Lu, Kong Wang, Yemin Wang +11

Eval Frameworks & Benchmarks Tool Use & Agents

3w ago·also MTLab

ConflictBench: Evaluating Human-AI Conflict via Interactive and Visually Grounded Environments

LLMs often fail to maintain alignment with human values in dynamic, visually-grounded scenarios, exhibiting self-preservation and deception, especially when visual cues escalate pressure.

Weixiang Zhao, Haozhen Li, Yanyan Zhao +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Matei Benescu +13w ago

Why Large Language Models can Secretly Outperform Embedding Similarity in Information Retrieval

LLMs may secretly be better at information retrieval than embedding similarity suggests, but current datasets are too "short-sighted" to prove it.

Matei Benescu, Ivo Pascal de Jong

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Cornelius Emde +63w ago

MASEval: Extending Multi-Agent Evaluation from Models to Systems

Framework choice in multi-agent systems matters just as much as the LLM itself, a fact obscured by existing model-centric benchmarks.

Cornelius Emde, Alexander Rubinstein, Anmol Goel +4

Eval Frameworks & Benchmarks Tool Use & Agents

3w ago

Towards a more efficient bias detection in financial language models

Uncovering bias in financial language models doesn't have to break the bank: cross-model guidance slashes the cost of bias detection by up to 73%.

Firas Hadj Kacem, Ahmed Khanfir, Mike Papadakis

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Ronald Sielinski3w ago

Quantifying Uncertainty in AI Visibility: A Statistical Framework for Generative Search Measurement

Generative search rankings are far more unstable than you think: single-run citation metrics provide a misleadingly precise view of domain visibility.

Ronald Sielinski

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Tzafrir Rehan3w ago

Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications

Forget prompt engineering voodoo: this framework treats agent prompts as compiled artifacts, using tests to drive development and catch silent regressions before they hit production.

Tzafrir Rehan

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

3w ago

OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

Even the most advanced LLMs stumble when asked to reason over a large, heterogeneous document corpus, achieving only 34% accuracy on the new OfficeQA Pro benchmark despite direct access to the relevant documents.

Krista Opsahl-Ong, Arnav Singhvi, Jasmine Collins +13

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

3w ago·also Sheffield

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

Speech LLMs, though lagging in accuracy, capture the nuances of human emotion perception better than traditional supervised methods, a finding revealed by the new VoxEmo benchmark.

Hezhao Zhang, Huang-Cheng Chou, Shrikanth S. Narayanan +1

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Google Research3w ago·also DeepMind, Babylon Health, Beth Israel Deaconess Medical Center, Beth Israel Lahey Health +5

A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic

LLM-powered diagnostic AI is ready for prime time: a real-world clinical trial shows it's safe, patients love it, and doctors find it useful.

Peter Brodeur, P. Brodeur, Jacob M. Koshy +59

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Romain Loncour +33w ago

Sensivity of LLMs'Explanations to the Training Randomness:Context, Class&Task Dependencies

LLM explanations are far more sensitive to the task being performed than the context or learned classes, highlighting a critical instability in current interpretability methods.

Romain Loncour, Jérémie Bogaert, Franccois-Xavier Standaert +1

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

3w ago

COACH meets QUORUM: A Framework and Pipeline for Aligning User, Expert and Developer Perspectives in LLM-generated Health Counselling

LLM-generated health counseling appears promising but reveals critical stakeholder disagreements on tone and error handling, highlighting the need for more nuanced evaluation beyond simple relevance and quality metrics.

Yee Man Ng, Yee-man Ng, Bram van Dijk +10

Eval Frameworks & Benchmarks Natural Language Processing

Yi Chen +43w ago

SCAFFOLD-CEGIS: Preventing Latent Security Degradation in LLM-Driven Iterative Code Refinement

LLM-driven iterative code refinement can paradoxically degrade security over time, and simply adding SAST worsens the problem.

Yi Chen, Yun Bian, Haiquan Wang +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Rajamangala University of Technology3w ago·also Shizuoka University

Are We Winning the Wrong Game? Revisiting Evaluation Practices for Long-Term Time Series Forecasting

Chasing marginal MSE/MAE improvements on leaderboards may be blinding researchers to the real goal of time series forecasting: capturing temporal structure and supporting downstream decisions.

Thanapol Phungtua-eng, Yoshitaka Yamamoto

Eval Frameworks & Benchmarks Scaling Laws & Emergent Abilities

Akshay Gulati +103w ago

Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines

SuperInvesting, a specialized AI system, significantly outperforms general-purpose LLMs like GPT and Gemini on a new financial intelligence benchmark, suggesting domain-specific architectures are crucial for reliable investment research.

Akshay Gulati, Kanha Singhania, Tushar Banga +8

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Sadegh Rahmaniboldaji +53w ago

Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations

Humans nail egocentric action recognition with minimal cues, while AI models often over-rely on context and surprisingly ignore temporal disruptions.

Sadegh Rahmaniboldaji, Filip Rybansky, Quoc C. Vuong +3

Computer Vision Eval Frameworks & Benchmarks Robotics & Embodied AI

3w ago

Deconstructing Multimodal Mathematical Reasoning: Towards a Unified Perception-Alignment-Reasoning Paradigm

Current multimodal math models struggle with visual interpretation, symbol alignment, and consistent reasoning, highlighting the need for a unified "Perception-Alignment-Reasoning" framework.

Tianyu Yang, Zhenwen Liang, Lisen Dai +1

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought