Search papers, labs, and topics across Lattice.
This paper audits the reliability of Google's Gemini 2.5 Pro as a multimodal generative search system, focusing on its ability to accurately ground claims in cited YouTube videos. The study analyzes 11,943 claim-video pairs across diverse domains, using LLM judges and human validation to assess claim support. Results show that 3.7% to 18.7% of video-grounded claims are unsupported, with failure modes including unverifiable specificities and overstated claims, linked to vocabulary divergence and low semantic similarity between claims and video transcripts.
Gemini 2.5 Pro hallucinates details into video-grounded claims 4-19% of the time, not by outright contradicting the source, but by injecting precise, unsupported specifics.
Multimodal Large Language Models (MLLMs) increasingly function as generative search systems that retrieve and synthesize answers from multimedia content, including YouTube videos. Although these systems project authority by citing specific videos as evidence, the extent to which these citations genuinely substantiate the generated claims remains unexamined. We present a large-scale audit of the Gemini 2.5 Pro multimodal search system, analyzing 11,943 claim-video pairs generated across Medical, Economic, and General domains. Through automated verification using three independent LLM judges (87.7% inter-rater agreement), validated against human annotations, we find that depending on the judge's strictness, between 3.7% and 18.7% of video-grounded claims are not supported by their cited sources. The dominant failure modes are not outright contradictions but rather unverifiable specificities and overstated claims, suggesting the system injects precise but ungrounded details from parametric knowledge while citing videos as evidence. Exploratory post-hoc analysis via logistic regression reveals properties associated with these failures: claims departing from source vocabulary ($尾= -1.6$ to $-3.1$, $p < 0.01$) and claims with low semantic similarity to the video transcript ($尾= -2.1$ to $-11.6$, $p < 0.01$) are significantly more likely to be unsupported. These findings characterize the current trustworthiness of video-based generative search and highlight the gap between the confidence these systems project and the fidelity of their outputs.