Search papers, labs, and topics across Lattice.
19 papers from Google Research on Eval Frameworks & Benchmarks
MLLMs are riddled with shared vulnerabilities across modalities, meaning a single weakness can be exploited to jailbreak safety filters, hijack instructions, or even poison training data.
Safety fine-tuning might inadvertently be stripping LLMs of their ability to understand non-human minds and entertain spiritual beliefs, even while preserving Theory of Mind.
ChatGPT's geographic reasoning can be surprisingly brittle, with minor syntactic changes causing significant output variations and task composition revealing unexpected distributional shifts.
MLLMs are surprisingly prone to hallucinating subtle details, especially when asked about the absence of specific attributes or relationships within an image.
Reasoning unlocks factual knowledge in LLMs, but beware: hallucinated reasoning steps can poison the well.
LLMs get *more* honest when they have time to reason, defying human tendencies and revealing surprising insights about their internal representational geometry.
LLM-powered diagnostic AI is ready for prime time: a real-world clinical trial shows it's safe, patients love it, and doctors find it useful.
Despite dedicated efforts from multiple teams, existing speech systems still fall significantly short of deployment readiness for understanding real-world medical conversations in Indian languages, highlighting the need for further research.
Finally, a framework to quantify AI's cultural intelligence, moving beyond ad-hoc cultural benchmarks to a systematic, extensible, and theoretically grounded approach.
LLM judges inflate math proof scores by up to 0.36 points, revealing a significant alignment gap with human experts and a reasoning breakdown in discrete domains.
Gemini 3 Deep Think can now autonomously solve a majority of problems in a challenging math competition, signaling a leap in AI's mathematical reasoning capabilities.
Forget painstakingly curating evaluation datasets: this framework generates high-quality, multi-hop multiple-choice questions from knowledge graphs with tunable difficulty, all while slashing costs.
LLMs still struggle with infrequently occurring knowledge, and this paper provides a structured framework to understand why, how we can fix it, and what the implications are for responsible AI.
Despite advances in multimodal models, they still struggle to understand spatial relationships from an egocentric perspective, as shown by a 37.66% performance gap on the new SAW-Bench benchmark.
LLMs like GPT-5 and Gemini-3 already "know" almost everything (95-98% factual encoding), but struggle to recall it, suggesting that future gains in factuality depend more on better memory retrieval than on simply scaling up.
LLMs can often achieve the same accuracy with significantly shorter self-explanations, suggesting that current chain-of-thought reasoning is unnecessarily verbose.
GPT-5's scientific reasoning skills plummet by nearly 50% when tackling multi-step workflows, revealing a critical gap in current LLM agents' ability to orchestrate complex tool use.
Forget "smart plagiarism" – multi-stage LLM workflows like recursive decomposition and long-context pipelines can actually generate novel research plans, outperforming simpler reflection-based methods.
Clinicians using a new medical literature mining LLM, LEADS, achieved 0.81 recall vs. 0.78 without it, while saving 20.8% of their time.