Google Research

×Eval Frameworks & Benchmarks

13 papers from Google Research on Eval Frameworks & Benchmarks

Apr 22, 2026

Google ResearchApr 22, 2026

RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

Current remote sensing change captioning datasets miss fine-grained localized semantic reasoning, but RSRCC fills this gap with 126k change-specific questions.

Roie Kazoom, Yotam Gigi, George Leifman +2

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Google ResearchApr 22, 2026·also Max Planck

Semantic Recall for Vector Search

Stop penalizing your ANN search algorithms for failing to retrieve irrelevant neighbors – Semantic Recall offers a more nuanced and effective way to measure retrieval quality.

Leonardo Kuffó, Ioanna Tsakalidou, Roberta De Viti +3

Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Apr 21, 2026

Google ResearchApr 21, 2026·also Bar-Ilan, Cambridge

Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs

Multilingual LLMs exhibit a surprising "American bias," even when prompted in other languages, and instruction tuning makes it worse.

Guy Mor-Lan, Omer Goldman, Matan Eyal +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Apr 20, 2026

Stanford HAIApr 20, 2026·also Google Research

FUSE: Ensembling Verifiers with Zero Labeled Data

FUSE achieves verification quality on par with semi-supervised methods, all without needing any labeled data.

Joonhyuk Lee, Virginia Ma, Sarah Zhao +4

Eval Frameworks & Benchmarks RLHF & Preference Learning

Apr 20, 2026·also Google Research, Department of Computer Science, UIUC, UMich

Revisiting Code Debloating with Ground Truth-based Evaluation

Debloating tools, intended to shrink code and improve security, can actually *add* code or remove essential functionality, with dynamic methods being overly aggressive and static methods overly conservative.

Muhammad Bilal, Moiz Ali, Mohit Kumar +4

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Apr 7, 2026

Apr 7, 2026·also Google Research

A Theoretical Framework for Statistical Evaluability of Generative Models

Forget KL divergence – this work shows you *can* reliably evaluate generative models with finite samples, but only if you use the right metric (IPMs with bounded test classes).

Shashaank Aiyer, Yishay Mansour, Shay Moran +1

Eval Frameworks & Benchmarks Natural Language Processing

Mar 30, 2026

Google ResearchMar 30, 2026·also Institute of Philosophy, Joint last authors., Northwestern, SFI +1

Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

Safety fine-tuning might inadvertently be stripping LLMs of their ability to understand non-human minds and entertain spiritual beliefs, even while preserving Theory of Mind.

Junsol Kim, Winnie Street, R. Rocca +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Mar 18, 2026

Munich Center for Machine LearningMar 18, 2026·also Google Research

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

MLLMs are surprisingly prone to hallucinating subtle details, especially when asked about the absence of specific attributes or relationships within an image.

Rui Xiao, Sanghwan Kim, Yongqin Xian +2

Eval Frameworks & Benchmarks Multimodal Models

Mar 10, 2026

Google ResearchMar 10, 2026·also AI2, TAU, Technion

Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs

Reasoning unlocks factual knowledge in LLMs, but beware: hallucinated reasoning steps can poison the well.

Zorik Gekhman, Roee Aharoni, E. Ofek +4

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Mar 9, 2026

Google ResearchMar 9, 2026·also DeepMind, Babylon Health, Beth Israel Deaconess Medical Center, Beth Israel Lahey Health +3

A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic

LLM-powered diagnostic AI is ready for prime time: a real-world clinical trial shows it's safe, patients love it, and doctors find it useful.

Peter Brodeur, P. Brodeur, Jacob M. Koshy +58

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Mar 1, 2026

Google ResearchMar 1, 2026

A Unified Framework to Quantify Cultural Intelligence of AI

Finally, a framework to quantify AI's cultural intelligence, moving beyond ad-hoc cultural benchmarks to a systematic, extensible, and theoretically grounded approach.

Sunipa Dev, Vinodkumar Prabhakaran, Rutledge Chin Feman +16

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Feb 24, 2026

DeepMindFeb 24, 2026·also CMU ML, Google Research

Aletheia tackles FirstProof autonomously

Gemini 3 Deep Think can now autonomously solve a majority of problems in a challenging math competition, signaling a leap in AI's mathematical reasoning capabilities.

Tony Feng, Tony Feng, Junehyuk Jung +26

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Feb 18, 2026

Google ResearchFeb 18, 2026

Long-Tail Knowledge in Large Language Models: Taxonomy, Mechanisms, Interventions and Implications

LLMs still struggle with infrequently occurring knowledge, and this paper provides a structured framework to understand why, how we can fix it, and what the implications are for responsible AI.

Sanket Badhe, Sanket Badhe, Nehal Kathrotia +1

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Scaling Laws & Emergent Abilities