Feb 26, 2026arXiv:2602.23353

SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Simon Roschmann, Simon Roschmann, Paul Krzakala, Paul Krzakala, Sonia Mazelet, Sonia Mazelet, Quentin Bouniot, Quentin Bouniot, Zeynep Akata, Zeynep Akata

AI Summary

The paper introduces SOTAlign, a semi-supervised framework for aligning pre-trained vision and language models using limited paired data and large amounts of unpaired data. SOTAlign first learns a coarse shared geometry from paired data using a linear teacher, and then refines the alignment on unpaired samples using an optimal-transport-based divergence to transfer relational structure. Experiments demonstrate that SOTAlign effectively leverages unpaired data, learns robust joint embeddings, and outperforms supervised and semi-supervised baselines across datasets and encoder pairs.

Key Contribution

Achieve meaningful vision-language model alignment with significantly less supervision by leveraging unpaired data via optimal transport.

Abstract

The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAlign effectively leverages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and significantly outperforming supervised and semi-supervised baselines.

Computer Vision Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References59

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Related Papers