BeihangFudanJD.comReceived 10 September 2025; revised 3Jun 4, 2026arXiv:2606.05874

Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models

Huiyuan Zheng, Houtao Zhang, Boyang Wang, Qingyi Si, Hongcheng Guo

AI Summary

This paper introduces RandomBench, a benchmark aimed at evaluating the stochastic behavior of Multimodal Large Language Models (MLLMs) in logic-neutral scenarios where multiple actions are equally valid. The authors identify a critical issue termed Stochastic Collapse, where MLLMs exhibit significant deviations from expected randomness, evidenced by top-1 probabilities soaring to 97% instead of the ideal 25%. Through comprehensive experiments and new metrics—RI, BCI, and BII—this work highlights the pervasive distributional bias in MLLMs, revealing that these models struggle to maintain uniform randomness across various languages and formats.

Key Contribution

MLLMs exhibit alarming Stochastic Collapse, failing to maintain randomness even under explicit random instructions, which could undermine their utility in diverse applications.

Abstract

Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios where multiple actions are equally valid, such as recommending travel itineraries or daily schedules where multiple options have similar utility. In such settings, deterministic policies may lead to repetitive behaviors and reduced coverage of valid alternatives. To bridge this gap, we propose RandomBench, a benchmark designed to evaluate whether MLLMs can maintain distributionally neutral behavior when selecting among equivalent options. We further introduce three metrics, including RI, BCI, BII, to quantify entropy and distributional bias. Experiments reveal a pervasive phenomenon termed Stochastic Collapse, where MLLMs fail to maintain uniform randomness under explicit random instructions, with top-1 probabilities reaching 97% from the ideal one quarter baseline and RI dropping to 0.068 in Claude Sonnet 4.6. Extensive ablation studies further demonstrate that these deviations persist across languages and representation formats, highlighting the robustness of distributional collapse in logic-neutral decision settings.

Eval Frameworks & Benchmarks Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models

Related Papers