KUApr 1, 2026arXiv:2604.00944

Auditing the Reliability of Multimodal Generative Search

Erfan Samieyan Sahneh, Luca Maria Aiello

AI Summary

This paper audits the reliability of Google's Gemini 2.5 Pro as a multimodal generative search system, focusing on its ability to accurately ground claims in cited YouTube videos. The study analyzes 11,943 claim-video pairs across diverse domains, using LLM judges and human validation to assess claim support. Results show that 3.7% to 18.7% of video-grounded claims are unsupported, with failure modes including unverifiable specificities and overstated claims, linked to vocabulary divergence and low semantic similarity between claims and video transcripts.

Key Contribution

Gemini 2.5 Pro hallucinates details into video-grounded claims 4-19% of the time, not by outright contradicting the source, but by injecting precise, unsupported specifics.

Abstract

Multimodal Large Language Models (MLLMs) increasingly function as generative search systems that retrieve and synthesize answers from multimedia content, including YouTube videos. Although these systems project authority by citing specific videos as evidence, the extent to which these citations genuinely substantiate the generated claims remains unexamined. We present a large-scale audit of the Gemini 2.5 Pro multimodal search system, analyzing 11,943 claim-video pairs generated across Medical, Economic, and General domains. Through automated verification using three independent LLM judges (87.7% inter-rater agreement), validated against human annotations, we find that depending on the judge's strictness, between 3.7% and 18.7% of video-grounded claims are not supported by their cited sources. The dominant failure modes are not outright contradictions but rather unverifiable specificities and overstated claims, suggesting the system injects precise but ungrounded details from parametric knowledge while citing videos as evidence. Exploratory post-hoc analysis via logistic regression reveals properties associated with these failures: claims departing from source vocabulary ($β= -1.6$ to $-3.1$, $p < 0.01$) and claims with low semantic similarity to the video transcript ($β= -2.1$ to $-11.6$, $p < 0.01$) are significantly more likely to be unsupported. These findings characterize the current trustworthiness of video-based generative search and highlight the gap between the confidence these systems project and the fidelity of their outputs.

Eval Frameworks & Benchmarks Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...