BaiduMar 2, 2026arXiv:2603.01471

Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality

Jiahan Chen, Jiahan Chen, Da Li, Da Li, Hengran Zhang, Hengran Zhang, Yinqiong Cai, Lixin Su, Jiafeng Guo, Jiafeng Guo, Daiting Shi, Daiting Shi, Dawei Yin, Keping Bi, Keping Bi

AI Summary

The paper introduces CoCoA, a content reconstruction pre-training paradigm based on collaborative attention, to improve multimodal embedding quality in MLLMs. CoCoA restructures the attention flow and introduces an EOS-based reconstruction task, forcing the model to compress input semantics into embeddings. Experiments on MMEB-V1 using Qwen2-VL and Qwen2.5-VL backbones demonstrate that CoCoA significantly enhances embedding quality by generating more compact and informative representations.

Key Contribution

Multimodal embeddings get a serious upgrade with CoCoA, a new pre-training method that forces models to compress all input information into a single token for reconstruction, leading to substantial quality gains.

Abstract

Multimodal embedding models, rooted in multimodal large language models (MLLMs), have yielded significant performance improvements across diverse tasks such as retrieval and classification. However, most existing approaches rely heavily on large-scale contrastive learning, with limited exploration of how the architectural and training paradigms of MLLMs affect embedding quality. While effective for generation, the causal attention and next-token prediction paradigm of MLLMs does not explicitly encourage the formation of globally compact representations, limiting their effectiveness as multimodal embedding backbones. To address this, we propose CoCoA, a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization. Specifically, we restructure the attention flow and introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the correspondingembeddings. This drives the multimodal model to compress the semantic information of the input into thetoken, laying the foundations for subsequent contrastive learning. Extensive experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality. Results validate that content reconstruction serves as an effective strategy to maximize the value of existing data, enabling multimodal embedding models generate compact and informative representations, raising their performance ceiling.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality

Related Papers