Mar 3, 2026arXiv:2603.02767

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

HanZpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He

AI Summary

The paper introduces ITO, a framework for image-text contrastive pretraining that aims to reduce modality-specific organization in learned representations. ITO employs multimodal multiple alignment to enhance supervision by discovering diverse image-text correspondences and uses a lightweight training-time multimodal fusion module to enforce structured cross-modal interaction. Experiments demonstrate that ITO outperforms strong baselines on classification, retrieval, and multimodal tasks, with analysis showing that multiple alignment improves discriminative power and training-time fusion acts as a structural regularizer.

Key Contribution

Image-text models can achieve superior performance by fusing modalities during training only, then discarding the fusion module at inference for efficiency.

Abstract

Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

Computer Vision Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

Related Papers