Mar 30, 2026arXiv:2603.28387

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

AI Summary

This paper evaluates 12 open-weight vision-language models (VLMs) on clinical neuroimaging classification tasks where the MRI data contains no diagnostic signal. Surprisingly, smaller VLMs show large performance gains (up to 58% F1) when neuroimaging context is mentioned in the prompt, even if no actual image is provided. The authors term this "scaffold effect," where merely mentioning the modality drives performance, highlighting the inadequacy of surface-level evaluations for multimodal reasoning.

Key Contribution

VLMs can appear to gain up to 58% F1 on clinical tasks simply by *mentioning* MRI data in the prompt, even when the data is uninformative, revealing a "scaffold effect" that inflates performance metrics.

Abstract

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

Eval Frameworks & Benchmarks Multimodal Models Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

Related Papers