Berkeley AI Research (BAIR)

×Eval Frameworks & Benchmarks

12 papers from Berkeley AI Research (BAIR) on Eval Frameworks & Benchmarks

Apr 30, 2026

3w ago·also BAIR, Mila, Toronto Metropolitan University, UofT

A Reproducibility Study of LLM-Based Query Reformulation

LLM-powered query reformulation, a hot topic in IR, often fails to translate gains from lexical to neural retrieval, and bigger models don't always help.

Amin Bigdeli, Radin Hamidi Rad, Hai Son Le +4

Eval Frameworks & Benchmarks Open-Source Models & Weights Recommendation & Information Retrieval

Apr 27, 2026

BAIR3w ago·also Melbourne, UIUC, University of California, University of Georgia

Green Shielding: A User-Centric Approach Towards Trustworthy AI

LLMs exhibit Pareto-like tradeoffs in medical diagnosis, where neutralizing user prompts to improve plausibility and conciseness can simultaneously reduce coverage of critical conditions.

Aaron Li, Nicola Sanchez, Hao Huang +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Apr 20, 2026

BAIRApr 20, 2026·also Technical University Munich, Toyota Technical Institute at Chicago, Tübingen

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

The dream of universal representations across modalities may be just that: scaling up datasets and relaxing constraints reveals that models trained on different modalities learn rich, but fundamentally different, representations of the world.

A. Sophia Koepke, Daniil Zverev, Shiry Ginosar +1

Eval Frameworks & Benchmarks Multimodal Models Scalable Oversight & Alignment Theory

Apr 16, 2026

ReviewerlyApr 16, 2026·also BAIR, University of Guelph, UofT

PeerPrism: Peer Evaluation Expertise vs Review-writing AI

Current LLM detection methods in peer review are fooled by hybrid human-AI workflows, mistaking AI-written text for AI-originated ideas.

Soroush Sadeghian, Soroush Sadeghian, Alireza Daqiq +8

Eval Frameworks & Benchmarks Natural Language Processing

Apr 13, 2026

BAIRApr 13, 2026·also Microsoft Research, Center for Computational Biology, Dept. of EECS, Dept. of Statistics

Sanity Checks for Agentic Data Science

Agentic data science pipelines often reach falsely optimistic conclusions, but two simple sanity checks can expose these unsupported claims by testing if the agent can reliably distinguish signal from noise.

Zachary T. Rewolinski, Austin V. Zane, Cheng-Long Wang

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Tool Use & Agents

BAIRApr 13, 2026·also Forensic Bioinformatic Services, Forensic Bioinformatics Services, Max Planck, Wright State University

Compliant But Unsatisfactory: The Gap Between Auditing Standards and Practices for Probabilistic Genotyping Software

AI audit standards can fail to ensure responsible AI practices due to vague requirements and undefined terms, even while appearing compliant.

Angela Jin, Alexander Asemota, Dan E. Krane +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Open-Source Models & Weights

Apr 6, 2026

UC Santa CruzApr 6, 2026·also BAIR, ByteDance, Tencent AI

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

Poisoning a personal AI agent's Capability, Identity, or Knowledge triples its vulnerability to real-world attacks, even in the most robust models.

Zijun Wang, Haoqin Tu, Letian Zhang +13

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Mar 17, 2026

UC Santa CruzMar 17, 2026·also BAIR, UNC

Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

LVLMs can be made significantly less prone to hallucinations, without any training, by explicitly grounding them in visual evidence and iteratively self-refining their answers based on verified information.

Jiawei Mao, Haoqin Tu, Yuhan Wang +2

Eval Frameworks & Benchmarks Multimodal Models

Mar 4, 2026

BAIRMar 4, 2026

iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

Existing QA benchmarks are too easy for LLMs, so iAgentBench offers a more realistic challenge by requiring agents to synthesize information from multiple sources on high-traffic topics.

Preetam Prabhu Srikar Dammu, A. Palkhiwala, Tanya Roosta +1

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Tool Use & Agents

CMU MLMar 4, 2026·also BAIR, MIT CSAIL, NVIDIA, Tsinghua AI +11

ManipulationNet: An Infrastructure for Benchmarking Real-World Robot Manipulation with Physical Skill Challenges and Embodied Multimodal Reasoning

Forget simulated manipulation—ManipulationNet offers a global infrastructure for benchmarking robots in the real world, complete with standardized hardware and software, to finally measure progress toward general manipulation.

Kenneth Kimble, Kenny Kimble, Edward H. Adelson +23

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Feb 23, 2026

BAIRFeb 23, 2026·also ETH, UBS Zurich

JUCAL: Jointly Calibrating Aleatoric and Epistemic Uncertainty in Classification Tasks

Forget temperature scaling: JUCAL calibrates aleatoric and epistemic uncertainty in classifier ensembles, achieving SOTA results with significantly smaller ensembles and lower inference costs.

Jakob Heiss, Sören Lambrecht, Jakob Weissteiner +3

Eval Frameworks & Benchmarks Training Efficiency & Optimization

Aug 6, 2025

UWAug 6, 2025·also Amazon Science, BAIR, Stanford HAI

I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations

LLMs evaluating job candidates exhibit significant bias against hedging language, docking candidates by 25.6% on average, even when the content is equivalent.

Julia Kharchenko, Tanya Roosta, Aman Chadha +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Search

Berkeley AI Research (BAIR)