Search papers, labs, and topics across Lattice.
13 papers from Google Research on Eval Frameworks & Benchmarks
Current remote sensing change captioning datasets miss fine-grained localized semantic reasoning, but RSRCC fills this gap with 126k change-specific questions.
Stop penalizing your ANN search algorithms for failing to retrieve irrelevant neighbors – Semantic Recall offers a more nuanced and effective way to measure retrieval quality.
Multilingual LLMs exhibit a surprising "American bias," even when prompted in other languages, and instruction tuning makes it worse.
FUSE achieves verification quality on par with semi-supervised methods, all without needing any labeled data.
Debloating tools, intended to shrink code and improve security, can actually *add* code or remove essential functionality, with dynamic methods being overly aggressive and static methods overly conservative.
Forget KL divergence – this work shows you *can* reliably evaluate generative models with finite samples, but only if you use the right metric (IPMs with bounded test classes).
Safety fine-tuning might inadvertently be stripping LLMs of their ability to understand non-human minds and entertain spiritual beliefs, even while preserving Theory of Mind.
MLLMs are surprisingly prone to hallucinating subtle details, especially when asked about the absence of specific attributes or relationships within an image.
Reasoning unlocks factual knowledge in LLMs, but beware: hallucinated reasoning steps can poison the well.
LLM-powered diagnostic AI is ready for prime time: a real-world clinical trial shows it's safe, patients love it, and doctors find it useful.
Finally, a framework to quantify AI's cultural intelligence, moving beyond ad-hoc cultural benchmarks to a systematic, extensible, and theoretically grounded approach.
Gemini 3 Deep Think can now autonomously solve a majority of problems in a challenging math competition, signaling a leap in AI's mathematical reasoning capabilities.
LLMs still struggle with infrequently occurring knowledge, and this paper provides a structured framework to understand why, how we can fix it, and what the implications are for responsible AI.