Search papers, labs, and topics across Lattice.
The authors introduce the Graded Color Attribution (GCA) dataset to evaluate if Vision-Language Models (VLMs) and humans adhere to self-identified decision rules for color labeling. They find that while VLMs can accurately estimate color coverage, they systematically violate their own stated rules for color attribution, particularly when world-knowledge priors are involved. In contrast, humans are faithful to their rules, with deviations explained by overestimation of color coverage, suggesting a fundamental difference in introspective self-knowledge between VLMs and humans.
VLMs may ace the color coverage test, but they flunk the "do as I say, not as I do" test, routinely ignoring their own stated reasoning rules in ways that humans don't.
Understanding when Vision-Language Models (VLMs) will behave unexpectedly, whether models can reliably predict their own behavior, and if models adhere to their introspective reasoning are central challenges for trustworthy deployment. To study this, we introduce the Graded Color Attribution (GCA) dataset, a controlled benchmark designed to elicit decision rules and evaluate participant faithfulness to these rules. GCA consists of line drawings that vary pixel-level color coverage across three conditions: world-knowledge recolorings, counterfactual recolorings, and shapes with no color priors. Using GCA, both VLMs and human participants establish a threshold: the minimum percentage of pixels of a given color an object must have to receive that color label. We then compare these rules with their subsequent color attribution decisions. Our findings reveal that models systematically violate their own introspective rules. For example, GPT-5-mini violates its stated introspection rules in nearly 60\% of cases on objects with strong color priors. Human participants remain faithful to their stated rules, with any apparent violations being explained by a well-documented tendency to overestimate color coverage. In contrast, we find that VLMs are excellent estimators of color coverage, yet blatantly contradict their own reasoning in their final responses. Across all models and strategies for eliciting introspective rules, world-knowledge priors systematically degrade faithfulness in ways that do not mirror human cognition. Our findings challenge the view that VLM reasoning failures are difficulty-driven and suggest that VLM introspective self-knowledge is miscalibrated, with direct implications for high-stakes deployment.