Search papers, labs, and topics across Lattice.
This study investigates the disparity in performance of multimodal LLMs when verifying scientific claims using tables versus charts, revealing that while models successfully encode chart information, they fail to utilize it effectively during predictions. Through layer-wise linear probing and attention analysis across three open-weight VLMs, the authors demonstrate that the gap in performance stems from a routing issue rather than a failure in encoding. The findings highlight two distinct architectural forms of this disconnect, suggesting a need for improved mechanisms to leverage visual data in model predictions.
Multimodal LLMs encode chart information but fail to route it effectively for predictions, revealing a critical gap in scientific claim verification.
Multimodal LLMs are increasingly used to assist scientific peer review, where a core requirement is verifying whether claims in a paper are supported by its evidence. Prior work has shown that models perform substantially better at this task when the evidence is a table than when it is a chart of the same underlying data. This raises the question of whether models fail to extract information from charts, or do they extract it but fail to use it when forming their prediction? We study this question through layer-wise linear probing and attention analysis on three open-weight VLMs over table and chart evidence, representing the same underlying data. We find consistent evidence for the latter. Chart information is encoded in the models' intermediate representations but does not reach the prediction position, a gap that is absent for tables and holds across all conditions tested. Attention analysis further reveals that this disconnect takes two architecturally distinct forms across model families. These findings reframe the table-chart gap as a failure of how encoded visual information is routed at prediction time, rather than a failure of encoding itself.