NUSJun 4, 2026arXiv:2606.05753

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

XiuYu Zhang, Junfeng Fang, Zhenkai Liang

AI Summary

This study investigates the effectiveness of auxiliary losses in latent visual reasoning (LVR) within vision-language models (VLMs), challenging the conventional belief that better alignment between latents and visual targets directly correlates with improved accuracy. Through a matrix of five LVR variants, the authors reveal a striking negative correlation (r=-0.94) between cosine alignment and accuracy, indicating that the supervised latents are often bypassed in the reasoning process. By introducing PRISM, a pair of inference-time diagnostics, the research uncovers that while answers can be decoded downstream of the latents, the latents themselves play a minimal role in the model's performance, reshaping our understanding of how auxiliary objectives influence language models.

Key Contribution

Cosine alignment in vision-language models may mislead researchers, as it correlates negatively with accuracy, revealing that latents are often bypassed in reasoning.

Abstract

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.

Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Related Papers