March 11 – March 18, 2026

Eval Frameworks & Benchmarks - Weekly Roundup

100 papers published across 8 labs.

14% acceleration

Selected Labs publishing this week

Stanford HAI4 Meta AI2 Google Research1 Microsoft Research1 Apple ML1

Top Papers

Mar 18, 2026

Priyaranjan Pattnayak +12w ago

IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

LLM safety doesn't translate: evaluations across 12 Indic languages reveal alarming safety drift and inconsistent responses to sensitive topics.

Priyaranjan Pattnayak, Sanchari Chowdhuri

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

2w ago

Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor

Current AI safety filters can't tell a joke from a threat, especially when humor relies on cultural context – this new benchmark exposes that blind spot.

Ahmed Sharshar, Hosam Elgendy, Saad El Dine Ahmed +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models+1

2w ago

Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)

SAM3 disappoints in eye image segmentation, failing to surpass SAM2's performance despite its new concept prompting mode.

Diederick C. Niehorster, Marcus Nyström

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Cem Uluoglakci +12w ago

Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination

Teaching LLMs to say "I don't know" is now possible via targeted SFT, slashing hallucination rates without sacrificing performance on other tasks.

Cem Uluoglakci, Tugba Taskaya Temizel

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Stupid Human2w ago·also Oxford

Auditing Preferences for Brands and Cultures in LLMs

LLMs exhibit consistent and detectable geographic preferences for brands and cultures, revealing potential biases in market intermediation that persist across user personas.

Jasmine Rienecker, Jasmine Rienecker, Katarina Mpofu +9

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Recommendation & Information Retrieval

All Papers (100)

Mar 18, 2026

2w ago

Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor

Current AI safety filters can't tell a joke from a threat, especially when humor relies on cultural context – this new benchmark exposes that blind spot.

Ahmed Sharshar, Hosam Elgendy, Saad El Dine Ahmed +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models+1

2w ago

Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)

SAM3 disappoints in eye image segmentation, failing to surpass SAM2's performance despite its new concept prompting mode.

Diederick C. Niehorster, Marcus Nyström

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Cem Uluoglakci +12w ago

Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination

Teaching LLMs to say "I don't know" is now possible via targeted SFT, slashing hallucination rates without sacrificing performance on other tasks.

Cem Uluoglakci, Tugba Taskaya Temizel

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Stupid Human2w ago·also Oxford

Auditing Preferences for Brands and Cultures in LLMs

LLMs exhibit consistent and detectable geographic preferences for brands and cultures, revealing potential biases in market intermediation that persist across user personas.

Jasmine Rienecker, Jasmine Rienecker, Katarina Mpofu +9

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Recommendation & Information Retrieval

2w ago·also WHU

From Isolated Scoring to Collaborative Ranking: A Comparison-Native Framework for LLM-Based Paper Evaluation

Stop training LLMs to assign arbitrary scores to papers in isolation; comparison-based ranking unlocks significantly better generalization and accuracy in paper evaluation.

Pujun Zheng, P. Zheng, Jiacheng Yao +8

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Dharshan Kumaran +72w ago

How do LLMs Compute Verbal Confidence

LLMs don't just regurgitate token probabilities when expressing confidence; they perform a more sophisticated, cached self-evaluation of answer quality.

Dharshan Kumaran, D. Kumaran, Arthur Conmy +5

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

H. Samanta2w ago

Impact of automatic speech recognition quality on Alzheimer's disease detection from spontaneous speech: a reproducible benchmark study with lexical modeling and statistical validation

Counterintuitively, better speech recognition unlocks surprisingly accurate Alzheimer's detection from simple text analysis, outperforming more complex acoustic models.

H. Samanta

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Gregory N. Frank2w ago

Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

Alignment evaluations that only check for dangerous concepts or outright refusals are missing the real action: models are getting sneakier at censorship by steering narratives instead of simply saying "no."

Gregory N. Frank

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Stanford HAI2w ago

ReSteer: Quantifying and Refining the Steerability of Multitask Robot Policies

Robots often ignore your commands mid-task, but ReSteer offers a way to fix this by pinpointing and patching the "blind spots" in their training data.

Zhenyang Chen, Alan Tian, Alan Tian +12

Eval Frameworks & Benchmarks Robotics & Embodied AI Tool Use & Agents

2w ago·also Northwestern

Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation

Current LMMs can't reliably turn complex images into code, failing to preserve structural integrity even in relatively simple scenarios, as shown by the new Omni-I2C benchmark.

Chi Zhang, Xiang Feng, Qiming Zhang +5

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Multimodal Models

Oliver Zahn +12w ago

Facts as First Class Objects: Knowledge Objects for Persistent LLM Memory

LLMs forget up to 60% of facts when summarizing and erode over half of project constraints during iterative compaction, but a simple discrete memory system (KOs) fixes this while slashing costs by 252x.

Oliver Zahn, Simran Chana

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Bassam Adnan +32w ago

ArchBench: Benchmarking Generative-AI for Software Architecture Tasks

Software architecture, a critical but underspecified domain, finally gets a unified benchmarking platform with ArchBench, enabling standardized evaluation of LLMs on complex system design tasks.

Bassam Adnan, Aviral Gupta, Sreemaee Akshathala +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Eval Frameworks & Benchmarks

GE HealthCare2w ago

Negation is Not Semantic: Diagnosing Dense Retrieval Failure Modes for Trade-offs in Contradiction-Aware Biomedical QA

Seemingly sophisticated dense retrieval methods can catastrophically fail at contradiction detection due to "Semantic Collapse," highlighting the surprising effectiveness of a simple, decoupled lexical approach for reliable biomedical QA.

Soumya Ranjan Sahoo, S. Sahoo, Gagan N. +3

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Leonardo Defilippis +32w ago

A Noise Sensitivity Exponent Controls Large Statistical-to-Computational Gaps in Single- and Multi-Index Models

A single Noise Sensitivity Exponent (NSE) dictates when learning becomes computationally intractable in high-dimensional single- and multi-index models.

Leonardo Defilippis, Florent Krzakala, Bruno Loureiro +1

Eval Frameworks & Benchmarks

Argentina Anna Rescigno +52w ago

ConGA: Guidelines for Contextual Gender Annotation. A Framework for Annotating Gender in Machine Translation

Current machine translation systems exhibit systematic masculine overuse and inconsistent feminine realization when translating from gender-neutral languages, a problem that can now be quantified thanks to a new gold-standard annotation framework.

Argentina Anna Rescigno, Argentina Anna Rescigno, Eva Vanmassenhove +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Chiara Manna +52w ago

Gender Disambiguation in Machine Translation: Diagnostic Evaluation in Decoder-Only Architectures

Instruction tuning can reduce masculine bias in decoder-only MT models, but these models still don't consistently outperform encoder-decoder architectures on gender-specific translation tasks.

Chiara Manna, Hosein Mohebbi, A. Alishahi +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Alireza Sadeghi +12w ago

Causal Representation Learning on High-Dimensional Data: Benchmarks, Reproducibility, and Evaluation Metrics

Current CRL benchmarks often fail to provide a holistic view of model performance, hindering progress, but a new aggregate metric could change that.

Alireza Sadeghi, Wael AbdAlmageed

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Open-Source Models & Weights

Universidad ORT Uruguay2w ago

TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

Simply prompting for test-driven development can *increase* regressions in AI coding agents; instead, focus on surfacing contextual information about which tests are most relevant.

Pepe Alonso

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Z.H. College of Engineering & Technology2w ago·also Aligarh Muslim University, Interdisciplinary Center for Artificial Intelligence

Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval

LLMs can be systematically shifted from stochastic pattern-matchers to verified truth-seekers using a carefully orchestrated, multi-stage retrieval and verification pipeline.

Md. Asraful Haque, Aasar Mehdi, Maaz Mahboob +1

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval+1

Risham Sidhu +32w ago

Grid Spatial Understanding: A Dataset for Textual Spatial Reasoning over Grids, Embodied Settings, and Coordinate Structures

LLMs struggle with spatial reasoning in embodied settings and 3D structure identification even when exposed to visual modalities, but fine-tuning smaller models offers a surprisingly effective alternative to brute-force scaling.

Risham Sidhu, Risham Sidhu, Julia Hockenmaier +1

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Robotics & Embodied AI

Stanford HAI2w ago

Deployment and Evaluation of an EHR-integrated, Large Language Model-Powered Tool to Triage Surgical Patients

Automating surgical patient triage with an LLM achieves 94% sensitivity, but discrepancies reveal more about clinical workflow gaps than AI errors.

Janelle B. Wang, T. Keyes, April S. Liang +11

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

2w ago

HWE-Bench: Can Language Models Perform Board-level Schematic Designs?

LLMs can read datasheets, but still can't design circuits, failing at basic physical intuition despite showing promise in documentation understanding.

Weibo Qiu, Yinhao Xiao, Runyu Pan

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Munich Center for Machine Learning2w ago·also Google Research

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

MLLMs are surprisingly prone to hallucinating subtle details, especially when asked about the absence of specific attributes or relationships within an image.

Rui Xiao, Sanghwan Kim, Yongqin Xian +2

Eval Frameworks & Benchmarks Multimodal Models

Priyaranjan Pattnayak +12w ago

IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

LLM safety doesn't translate: evaluations across 12 Indic languages reveal alarming safety drift and inconsistent responses to sensitive topics.

Priyaranjan Pattnayak, Sanchari Chowdhuri

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Yaoyu Wang +92w ago·also Corresponding author, University of Innsbruck, USTC

Event-Centric Human Value Understanding in News-Domain Texts: An Actor-Conditioned, Multi-Granularity Benchmark

Current AI struggles to understand human values in real-world news events, often missing the who, what, and why – until now.

Yaoyu Wang, Yao Wang, Xin Liu +7

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Fiammetta Caccavale +132w ago

Large Language Models in Teaching and Learning: Reflections on Implementing an AI Chatbot in Higher Education

Students perceive AI assistants as less intimidating and more approachable than human teachers, but also recognize limitations in specialized knowledge and nuanced feedback.

Fiammetta Caccavale, F. Caccavale, Carina L. Gargalo +11

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

2w ago

Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety

Current LLM agent safety benchmarks are missing over 20% of unsafe behaviors, even after agents pass the benchmark.

Xuan Chen, Lu Yan, Ruqi Zhang +1

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Amine Lbath +12w ago

Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection

Automated injection of realistic vulnerabilities and synthesis of PoV exploits finally makes scalable, precisely labeled, repository-level vulnerability datasets a reality.

Amine Lbath, Amine Lbath

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Nathan Zhao2w ago

WebPII: Benchmarking Visual PII Detection for Computer-Use Agents

Current PII detection models are blind to the transaction-level identifiers and partially-filled forms that computer-use agents readily expose, but a new benchmark closes the gap.

Nathan Zhao

Computer Vision Data Curation & Synthetic Data Eval Frameworks & Benchmarks

2w ago

Bootstrapping Coding Agents: The Specification Is the Program

Forget about chasing the perfect model architecture – this work suggests the real key to better AI agents lies in crafting more precise and complete specifications, since the implementation can always be re-generated.

Martin Monperrus, M. Monperrus

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

2w ago

From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation

Current machine translation systems often fail to capture the nuances of culturally-loaded expressions, highlighting a critical gap in their ability to truly understand and translate language.

Bangju Han, Yingqi Wang, Huang Qing +9

Eval Frameworks & Benchmarks Natural Language Processing

Zichen Tang +52w ago

Is Your LLM-as-a-Recommender Agent Trustable? LLMs'Recommendation is Easily Hacked by Biases (Preferences)

LLM-powered recommendation agents, despite their reasoning prowess, are easily manipulated by contextual biases in high-stakes scenarios like paper review and job recruitment.

Zichen Tang, Zirui Zhang, Ziru Zhang +3

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

Alexander D. Goldie +192w ago

Procedural Generation of Algorithm Discovery Tasks in Machine Learning

Stop benchmarking algorithm discovery on the same old saturated datasets: DiscoGen offers millions of fresh, configurable tasks to truly test your ADA.

Alexander D. Goldie, Zilin Wang, Adrian Hayler +17

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Training Efficiency & Optimization

Oksana Kolomenko +22w ago

Embedding World Knowledge into Tabular Models: Towards Best Practices for Embedding Pipeline Design

Forget chasing leaderboard hype: this study reveals that larger embedding models and strategic concatenation are key to unlocking LLM-powered tabular prediction, regardless of public rankings.

Oksana Kolomenko, Ricardo Knauer, Erik Rodner

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Zichen Xie +12w ago

Can LLMs Reason Like Automated Theorem Provers for Rust Verification? VCoT-Bench: Evaluating via Verification Chain of Thought

LLMs can't reason their way through Rust verification, struggling to complete proofs even with substantial hints, revealing a critical gap in their ability to handle the rigorous demands of secure software development.

Zichen Xie, Wenxi Wang

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Joohyoung Jeon +12w ago

Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimization

LLM-powered trading agents can still achieve a Sharpe ratio of 1.40 even when completely blindfolded to ticker symbols and company names, suggesting genuine understanding of market dynamics.

Joohyoung Jeon, Hongchul Lee

Eval Frameworks & Benchmarks Tool Use & Agents

Sinan Ibrahim +52w ago·also Research Center for Digital Engineering

Benchmarking Reinforcement Learning via Stochastic Converse Optimality: Generating Systems with Known Optimal Policies

Finally, a rigorous RL benchmark: generate environments with *provably* optimal policies, enabling controlled algorithm evaluation against ground truth.

Sinan Ibrahim, Grégoire Ouerdane, Hadi Salloum +3

Eval Frameworks & Benchmarks Robotics & Embodied AI World Models & Planning

Marwa Abdulhai +122w ago

How LLMs Distort Our Written Language

LLMs don't just change *how* we write, they subtly distort *what* we mean, leading to blander, less insightful, and potentially biased communication.

Marwa Abdulhai, Marwa Abdulhai, Isadora White +10

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Maria Andueza Rodriguez +22w ago

Modeling the human lexicon under temperature variations: linguistic factors, diversity and typicality in LLM word associations

LLMs can mimic human lexical patterns, but larger models act like stereotypical humans, sacrificing diversity for typicality in word associations, a trade-off tunable by temperature.

Maria Andueza Rodriguez, Marie Candito, Richard Huyghe

Eval Frameworks & Benchmarks Natural Language Processing Open-Source Models & Weights

Ja Young Lee +82w ago

GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation

Stop trusting those benchmarks: GRAFITE offers a framework to continuously QA LLMs against real-world issues reported by users, revealing performance regressions masked by static benchmarks.

Ja Young Lee, M'irian Silva, Mohamed Nasr +6

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

2w ago·also Cisco Research, IIT, National Technical University, NYU +1

SafeTutors: Benchmarking Pedagogical Safety in AI Tutoring Systems

AI tutors can quietly erode learning through answer over-disclosure and misconception reinforcement, with pedagogical failures rising to a staggering 77.8% in multi-turn dialogues.

Rima Hazra, Bikram Ghuku, Ilona Marchenko +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Madhav S. Baidya +22w ago

Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions

AI-generated text detectors that seem perfect in the lab fall apart in the real world, with no single method generalizing across domains or even different LLMs.

Madhav S. Baidya, S. S. Baidya, Chirag Chawla

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Segyu Lee +102w ago

UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models

Multimodal AI models are surprisingly unsafe, especially when generating images or handling multiple images at once, according to a new benchmark exposing critical vulnerabilities.

Segyu Lee, Boryeong Cho, Hojung Jung +8

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

2w ago

PJB: A Reasoning-Aware Benchmark for Person-Job Retrieval

Stop chasing leaderboard gains on generic benchmarks: PJB reveals that domain-specific weaknesses in person-job retrieval far outweigh the benefits of general model upgrades, and that query understanding modules can actually hurt performance.

Guangzhi Wang, Xiaohui Yang, Kai Li +4

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Hyun Ryu +52w ago·also KAIST

Argument Reconstruction as Supervision for Critical Thinking in LLMs

Training LLMs to reconstruct arguments boosts their critical thinking abilities across diverse tasks, suggesting a promising new direction for imbuing reasoning skills.

Hyun Ryu, Gyouk Chu, Gregor Betz +3

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Mohsen Arjmandi2w ago

Sensi: Learn One Thing at a Time -- Curriculum-Based Test-Time Learning for LLM Game Agents

LLM agents can learn task structure at test time with 50-94x greater sample efficiency using a curriculum-based learning system, but this reveals a critical bottleneck in perceptual grounding that must be addressed.

Mohsen Arjmandi

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

2w ago·also Meta AI, UNC

Text-to-Stage: Spatial Layouts from Long-form Narratives

LLMs can now infer plausible stage layouts from unstructured text alone, opening up new possibilities for automated media production.

Jefferson Hernandez, Swarnadeep Saha, Chenxi Whitehouse +8

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Yuntong Zhang +22w ago·also Max-Planck Insitute of Security and Privacy

VeriGrey: Greybox Agent Validation

Grey-box fuzzing of LLM agents, guided by tool invocation sequences, reveals significantly more prompt injection vulnerabilities and malicious behaviors than black-box testing alone.

Yuntong Zhang, Sungmin Kang, Marcel Böhme

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago

Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

Video fine-tuning boosts MLLMs' video smarts, but surprisingly dumbs them down on static images – a trade-off you can't simply brute-force away with more frames.

Linghao Zhang, Jungang Li, Yonghua Hei +12

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

2w ago

Scalable and Personalized Oral Assessments Using Voice AI

Oral exams, previously impossible to scale, can now be delivered for pennies using voice AI, but controlling LLM behavior requires architectural guardrails, not just clever prompts.

Panos Ipeirotis, Konstantinos Rizakos

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

2w ago·also Laboratoire IBISC, Paris-Saclay, Univ-Evry

WeatherReasonSeg: A Benchmark for Weather-Aware Reasoning Segmentation in Visual Language Models

VLMs struggle to reason about visual scenes in adverse weather, losing significant segmentation accuracy as rain, snow, or fog intensifies.

Wanjun Du, Zifeng Yuan, Tingting Chen +3

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Yanchuan Tang +82w ago

Shifting Uncertainty to Critical Moments: Towards Reliable Uncertainty Quantification for VLA Model

Don't let your robot's brief moment of panic get lost in the noise – this new uncertainty method spotlights those critical spikes to predict failures before they happen.

Yanchuan Tang, Taowen Wang, Yue Chen +6

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Mar 17, 2026

University of Calcutta2w ago·also Indian Statistical Institute Kolkata, Mississippi State University

Beyond Accuracy: Evaluating Forecasting Models by Multi-Echelon Inventory Cost

Temporal CNNs and LSTMs can slash inventory costs and boost fill rates compared to traditional forecasting methods, offering a tangible advantage for supply chain optimization.

Swata Marik, Swayamjit Saha, Garga Chatterjee

Eval Frameworks & Benchmarks Recommendation & Information Retrieval

2w ago

VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

Current multimodal browsing agents are surprisingly bad at using visual information on webpages, with even top models scoring below 50% accuracy on a new visual-native search benchmark.

Zhengbo Zhang, Jinbo Su, Zhaowen Zhou +12

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

2w ago

Nonstandard Errors in AI Agents

Even when given identical data and research questions, autonomous AI coding agents exhibit surprisingly high variability in their empirical findings, raising concerns about the reliability of AI-driven research.

Ruijiang Gao, Steven Chong Xiao

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Rebecca Ansell +12w ago

How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment

LLMs can't crack Clue: even state-of-the-art models struggle with multi-step deductive reasoning in a simulated text-based game, and fine-tuning doesn't reliably help.

Rebecca Ansell, Autumn Toney-Wails

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Xingyu Liu +52w ago

Unified Removal of Raindrops and Reflections: A New Benchmark and A Novel Pipeline

Real-world images plagued by both raindrops and reflections finally get a dedicated benchmark dataset (RDRF) and a diffusion-based model (DiffUR$^3$) that actually works.

Xingyu Liu, Zewei He, Chunyu Zhu +3

Computer Vision Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Mohamed Adel +32w ago

Arabic Morphosyntactic Tagging and Dependency Parsing with Large Language Models

Instruction-tuned LLMs can nearly match supervised baselines on complex Arabic morphosyntactic tagging and dependency parsing, but only with careful prompt engineering and retrieval-based in-context learning.

Mohamed Adel, M. Adel, Bashar Alhafni +1

Eval Frameworks & Benchmarks Natural Language Processing

IMT France2w ago·also ANITI 2 France, INRIA, IRIT France

Probing Cultural Signals in Large Language Models through Author Profiling

LLMs can guess a singer's ethnicity from their lyrics, but they're biased: most default to North American, while DeepSeek-1.5B leans Asian.

Valentin Lafargue, Ariel Guerra-Adames, E. Claeys +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Engineering Group2w ago

EngGPT2: Sovereign, Efficient and Open Intelligence

This Italian LLM punches way above its weight, matching the performance of models trained on 6-10x more data while using only 3B active parameters during inference.

G. Ciarfaglia, A. Rosanova, S. Cipolla +13

Eval Frameworks & Benchmarks Open-Source Models & Weights Training Efficiency & Optimization

Lucas Bandarkar +22w ago

Large Reasoning Models Struggle to Transfer Parametric Knowledge Across Scripts

LLMs struggle to transfer knowledge across different writing scripts, even within the same language, revealing a critical limitation in current cross-lingual understanding.

Lucas Bandarkar, Alan Ansell, Trevor Cohn

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Yi Zhou +12w ago

Evaluating Ill-Defined Tasks in Large Language Models

LLM benchmarks for complex tasks often produce scores that are meaningless and misleading, masking distinct failure modes and hindering progress.

Yi Zhou, Basel Shbita

Eval Frameworks & Benchmarks Natural Language Processing

University of Innsbruck2w ago·also MEF University

How often do Answers Change? Estimating Recency Requirements in Question Answering

LLMs struggle with questions requiring up-to-date information, especially when the recency requirement is context-dependent, highlighting a critical gap in temporal reasoning.

Bhawna Piryani, Zehra Mert

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Daejeon2w ago·also Jungang Cheonggua Co.

More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification

Multi-turn review actually *worsens* LLM verification compared to single-pass review, as reviewers fabricate findings and critique the conversation itself rather than the artifact.

Song Tae-Eun

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

ISP RAS Research Center for Trusted AI2w ago·also HSE University, S-NLP Group, Skoltech

Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures

LLMs often fail to update their final predictions after interventions on intermediate reasoning steps, suggesting that these structures function more as influential context than stable causal mediators.

Oleg Somov, Mikhail Chaichuk, Mikhail Seleznyov +2

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Karthik Govindappa2w ago

Visual Product Search Benchmark

Off-the-shelf foundation models struggle with instance-level visual product search in industrial settings, often falling short compared to domain-specific models.

Karthik Govindappa

Computer Vision Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Aniket Pramanick +32w ago

ClaimFlow: Tracing the Evolution of Scientific Claims in NLP

Most scientific claims in NLP die in obscurity, and even the survivors are more likely to be subtly reshaped than outright validated or debunked.

Aniket Pramanick, Yufang Hou, Saif M. Mohammad +1

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

2w ago

On the Emotion Understanding of Synthesized Speech

SER models, often assumed to generalize well to synthesized speech, actually fail miserably, revealing their reliance on spurious correlations rather than genuine emotional understanding.

Yuan Ge, Haishu Zhao, Aokai Hao +9

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Stefano Bannò +32w ago

Exploiting the English Grammar Profile for L2 grammatical analysis with LLMs

LLMs beat rule-based systems at understanding nuanced grammar in language learners, but good old-fashioned rules still win on pure syntax.

Stefano Bannò, Penny Karanasou, Kate Knill +1

Eval Frameworks & Benchmarks Natural Language Processing

2w ago

When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

Coding agents struggle to maintain faithfulness to specifications that emerge gradually over long interactions, losing significant implementation fidelity compared to single-shot specifications.

Lu Yan, Xuan Chen, Xiangyu Zhang

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Xiamen University2w ago·also Rochester, Sichuan Agricultural University

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Current Omni-modal LLMs can ace perception tasks but still fail at basic social interactions like knowing when and how to jump into a conversation.

Tianyu Xie, Jinfa Huang, Yuexiao Ma +7

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Stanford HAI2w ago·also Cornell, Georgia Tech, Ulu Lāhui Foundation

Whose Knowledge Counts? Co-Designing Community-Centered AI Auditing Tools with Educators in Hawai`i

Educators in Hawai'i envision AI auditing tools that trace the genealogy of knowledge, highlighting the need for community-centered approaches to address cultural misrepresentation in AI.

Michael J. Ryan, Angelina Wang, Evyn-Bree Helekahi-Kaiwi +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

2w ago

Detecting Data Poisoning in Code Generation LLMs via Black-Box, Vulnerability-Oriented Scanning

CodeScan achieves 97%+ accuracy in detecting data poisoning attacks in code generation LLMs by identifying structural similarities across generations, even when semantics are expressed in diverse syntactic forms.

Shenao Yan, Shimaa Ahmed, Shan Jin +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Microsoft Research2w ago

Intent Formalization: A Grand Challenge for Reliable Coding in the Age of AI Agents

AI-generated code's fluency masks a critical flaw: it often fails to deliver what users actually intend, highlighting the urgent need for "intent formalization" to bridge the gap between informal requirements and precise program behavior.

Shuvendu K. Lahiri

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

2w ago

Good Arguments Against the People Pleasers: How Reasoning Mitigates (Yet Masks) LLM Sycophancy

Chain-of-Thought reasoning in LLMs is a double-edged sword, reducing sycophancy in final answers but simultaneously masking it with deceptive, logically inconsistent justifications.

Zhaoxin Feng, Zheng Chen, Jianfei Ma +3

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

UC Santa Cruz2w ago·also Apple ML, CUHK

Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

LVLMs can be made significantly less prone to hallucinations, without any training, by explicitly grounding them in visual evidence and iteratively self-refining their answers based on verified information.

Jiawei Mao, Hardy Chen, Haoqin Tu +5

Eval Frameworks & Benchmarks Multimodal Models

Caglar Yildirim2w ago

Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

Mental health disclosures in user profiles can *increase* LLM agent refusal rates on both harmful and benign tasks, revealing a fragile safety-utility trade-off easily overridden by jailbreaks.

Caglar Yildirim

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

2w ago·also MIT CSAIL, HUJI

Mediocrity is the key for LLM as a Judge Anchor Selection

Using a top or bottom-performing LLM as an anchor in "LLM-as-a-judge" benchmarks can dramatically skew results, making the choice of a mediocre anchor key to reliable evaluation.

Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen +1

Eval Frameworks & Benchmarks Natural Language Processing

2w ago·also Stanford HAI, Xiaomi EV, Yale

Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

LLMs' chain-of-thought reasoning often falls apart due to factual incompleteness, with errors compounding across multiple hops, as revealed by a new multi-hop QA dataset.

Xiaojie Gu, Sherry T. Tong, Aosong Feng +8

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Robert Welch +22w ago

The Cost of Reasoning: Chain-of-Thought Induces Overconfidence in Vision-Language Models

Chain-of-thought reasoning makes vision-language models *more* overconfident, even when it improves accuracy.

Robert Welch, Emir Konuk, Kevin Smith

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Subina Khanal +42w ago

Bridging the High-Frequency Data Gap: A Millisecond-Resolution Network Dataset for Advancing Time Series Foundation Models

Current time series foundation models struggle with millisecond-resolution 5G network data, revealing a critical gap in their ability to generalize to high-frequency real-world applications.

Subina Khanal, Seshu Tirupathi, Merim Dzaferagic +2

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Open-Source Models & Weights

2w ago

Decoding the Critique Mechanism in Large Reasoning Models

LRMs can often recover from injected errors in their reasoning steps, revealing a hidden "critique" ability that can be harnessed to improve performance without additional training.

Hoang Phan, Quang H. Nguyen, Hung T. Q. Le +3

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

2w ago·also TU Delft, Vrije Universiteit Amsterdam

Leveraging LLMs for Structured Information Extraction and Analysis from Cloud Incident Reports (Work In Progress Paper)

Lightweight LLMs like Gemini 2.0 and GPT-3.5 can extract key metadata from cloud incident reports with surprisingly high accuracy (75-95%), offering a cost-effective alternative to larger models.

Xiaoyu Chu, Shashikant Ilager, Yizhen Zang +2

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval+1

Martin Mayr +52w ago

AI Application Benchmarking: Power-Aware Performance Analysis for Vision and Language Models

Forget one-size-fits-all power caps: the optimal energy efficiency for AI workloads on GPUs varies wildly by application and architecture.

Martin Mayr, Sebastian Wind, Lukas Schroder +3

Eval Frameworks & Benchmarks Multimodal Models Training Efficiency & Optimization

2w ago

SEAHateCheck: Functional Tests for Detecting Hate Speech in Low-Resource Languages of Southeast Asia

Hate speech detection models stumble badly on Tagalog and slang in Southeast Asian languages, revealing critical gaps in current approaches.

Riggs Ng, Aditi Kumaresan, Yujia Hu +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

2w ago·also UIUC

Answer Bubbles: Information Exposure in AI-Mediated Search

Generative search engines create "answer bubbles" by selectively citing and framing information, leading to divergent information realities compared to traditional search.

Michelle Huang, Agam Goyal, Koustuv Saha +1

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Xinyi Yang +42w ago

Visual Distraction Undermines Moral Reasoning in Vision-Language Models

Visual inputs can hijack the moral compass of VLMs, causing them to abandon carefully tuned text-based safety protocols and make surprisingly unethical decisions.

Xinyi Yang, Chenheng Xu, Weijun Hong +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Multimodal Models

Titus Malsburg +32w ago

Diverging Transformer Predictions for Human Sentence Processing: A Comprehensive Analysis of Agreement Attraction Effects

Transformer language models stumble on complex syntactic structures, failing to mimic human-like error patterns in agreement attraction, suggesting current architectures lack crucial aspects of human morphosyntactic processing.

Titus Malsburg, Titus von der Malsburg, Sebastian Pad'o +1

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Natural Language Processing

Meta AI2w ago·also JHU, UCL, UPC

Omnilingual MT: Machine Translation for 1,600 Languages

Forget scaling laws: a specialized 8B parameter translation model can outperform a 70B general-purpose LLM on 1,600 languages.

Omnilingual MT Team Belen Alastruey, Omnilingual MT Team, Niyati Bafna +36

Eval Frameworks & Benchmarks Natural Language Processing

Matthijs Jansen op de Haar +32w ago

Beyond Grading Accuracy: Exploring Alignment of TAs and LLMs

Open-source LLMs can grade UML diagrams with near-human accuracy on individual criteria, paving the way for AI-assisted teaching without relying on proprietary models.

Matthijs Jansen op de Haar, N. Bouali, Nacir Bouali +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Open-Source Models & Weights

Bharat Runwal +32w ago·also IIT Kharagpur

PRISM: Demystifying Retention and Interaction in Mid-Training

Forget RLHF alchemy - this study shows that *what* you teach your LLM *before* RLHF is the real secret to unlocking reasoning abilities.

Bharat Runwal, Ashish Agrawal, Anurag Roy +1

Architecture Design (Transformers, SSMs, MoE)Eval Frameworks & Benchmarks Training Efficiency & Optimization

University of Zürich2w ago·also The Árni Magnússon Institute for Icelandic, University of Iceland

Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

LLM benchmarks in low-resource languages are likely garbage, with synthetic or machine-translated data introducing severe flaws that skew results.

Finnur Ágúst Ingimundarson, Steinunn Rut Friðriksdóttir, Bjarki Ármannsson +2

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Jia Ming Li +22w ago

GeMA: Learning Latent Manifold Frontiers for Benchmarking Complex Systems

Benchmarking complex systems just got a geometric upgrade: GeMA learns latent manifold frontiers to reveal hidden inefficiencies and technological structures, outperforming traditional methods when heterogeneity and scale bias muddy the waters.

Jia Ming Li, Anupriya, Daniel J. Graham

Eval Frameworks & Benchmarks

E. Reddy +12w ago

Are Large Language Models Truly Smarter Than Humans?

LLMs' apparent superhuman performance on benchmarks may be a mirage: contamination inflates scores by up to 20% in some domains, revealing a critical flaw in current evaluation practices.

E. Reddy, Sourav Karmakar

Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Idealworks2w ago·also NVIDIA, NSFC

Industrial cuVSLAM Benchmark&Integration

A hybrid cuVSLAM-based visual SLAM system achieves superior mapping accuracy in real-world logistics environments, outperforming other VO/VSLAM approaches.

Charbel Abi Hana, Kameel Amareen, M. Mostafa +5

Computer Vision Eval Frameworks & Benchmarks Robotics & Embodied AI

Sangyeon Yoon +72w ago

BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

LLMs struggle to selectively apply user preferences stored in memory, often misapplying them even when social norms dictate otherwise, revealing a critical gap in context-aware personalization.

Sangyeon Yoon, SunKyoung Kim, Hyesoo Hong +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Recommendation & Information Retrieval

2w ago

SciZoom: A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era

LLM-assisted scientific writing is producing more confident but homogenized prose, as evidenced by a 23% decline in hedging in the post-LLM era.

Han Jang, Junhyeok Lee, Kyu Sung Choi

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

AI22w ago·also Alongside.care

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

Synthetic benchmarks can't catch the nuances of personalized deep research, as real users revealed nine critical errors that LLM judges missed entirely.

Nishant Balepur, Nishant Balepur, Malachi Hamada +14

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Tik Yu Yim +42w ago

ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning

LLMs can gain substantial financial reasoning skills without fine-tuning, thanks to a new framework that distills knowledge into human-readable, version-controlled skill artifacts.

Tik Yu Yim, Wenting Tan, Sum Yee Chan +2

Eval Frameworks & Benchmarks Inference & Quantization Reasoning & Chain-of-Thought

Valentina Pyatkin +22w ago

TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities

Language models can get a 12% boost in multi-turn conversation quality from just 10k examples of multi-turn training data, highlighting the critical gap between single-turn and multi-turn capabilities.

Valentina Pyatkin, Nathan Lambert, Hannaneh Hajishirzi

Eval Frameworks & Benchmarks Natural Language Processing