MIT CSAILChongqingHKUMcGovern Institute for Brain ResearchMay 27, 2026arXiv:2605.28818

VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

Jinzhou Wu, Zhengwu Ma, Jixing Li, Baoping Tang, Zitong Lu

AI Summary

This paper investigates whether vision-language models (VLMs) exhibit better alignment with human brain activity and eye movements during natural reading compared to language-only models (LLMs). By comparing matched LLM and VLM pairs in a text-only setting, the study isolates the impact of multimodal pretraining on human alignment. The key finding is that VLMs do not show a global advantage in aligning with human fMRI and eye-tracking data during natural reading, suggesting that language-internal representations are more critical for modeling human text processing, although VLMs may show advantages for sentences with stronger visual semantic content.

Key Contribution

Multimodal pretraining doesn't guarantee better alignment with human reading patterns, suggesting that language-internal representations are still king when modeling how humans process text.

Abstract

Large language models (LLMs) have become increasingly useful computational models of human language processing, but it remains unclear whether vision-language learning makes text representations more human-like during natural reading. Here, we address this question by comparing tightly matched LLM and vision-language model (VLM) pairs under a strictly text-only setting, allowing us to isolate the effect of multimodal training history from online visual input or cross-modal fusion. We evaluate model alignment with a human natural-reading dataset that includes whole-cortex fMRI responses and synchronized eye-tracking saccades. Our findings demonstrate that multimodal pretraining may not confer a uniform, global advantage in human alignment during natural reading, indicating that language-internal representations remain the key factor for modeling human text processing. However, the VLM advantage could emerge more selectively when sentences contain stronger visual semantic content, with converging evidence from both fMRI and eye-movement alignments. Together, our findings provide a controlled in silico framework for testing how visual learning history shapes model-human alignment of language processing, suggesting that multimodal pretraining contributes selectively rather than globally to human-like language representations during natural reading.

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

Related Papers