Search papers, labs, and topics across Lattice.
This paper introduces RandomBench, a benchmark aimed at evaluating the stochastic behavior of Multimodal Large Language Models (MLLMs) in logic-neutral scenarios where multiple actions are equally valid. The authors identify a critical issue termed Stochastic Collapse, where MLLMs exhibit significant deviations from expected randomness, evidenced by top-1 probabilities soaring to 97% instead of the ideal 25%. Through comprehensive experiments and new metrics鈥擱I, BCI, and BII鈥攖his work highlights the pervasive distributional bias in MLLMs, revealing that these models struggle to maintain uniform randomness across various languages and formats.
MLLMs exhibit alarming Stochastic Collapse, failing to maintain randomness even under explicit random instructions, which could undermine their utility in diverse applications.
Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios where multiple actions are equally valid, such as recommending travel itineraries or daily schedules where multiple options have similar utility. In such settings, deterministic policies may lead to repetitive behaviors and reduced coverage of valid alternatives. To bridge this gap, we propose RandomBench, a benchmark designed to evaluate whether MLLMs can maintain distributionally neutral behavior when selecting among equivalent options. We further introduce three metrics, including RI, BCI, BII, to quantify entropy and distributional bias. Experiments reveal a pervasive phenomenon termed Stochastic Collapse, where MLLMs fail to maintain uniform randomness under explicit random instructions, with top-1 probabilities reaching 97% from the ideal one quarter baseline and RI dropping to 0.068 in Claude Sonnet 4.6. Extensive ablation studies further demonstrate that these deviations persist across languages and representation formats, highlighting the robustness of distributional collapse in logic-neutral decision settings.