ChongqingChongqing Ant Consumer Finance CoChongqing Key Laboratory of ComputationalCollege of Computer Science and Technologyconstruction Key Laboratory of DigitalKey Laboratory of Cyberspace Big DataMinistry of EducationSichuan-Chongqing CoJun 8, 2026arXiv:2606.09331

Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

Shiyu Li, Yifan Wang, Peiming Li, Zheng Wei, Yang Tang

AI Summary

This paper introduces Conan-embedding-v3, a novel framework for omni-modal retrieval that integrates modality-specific models through a decouple--fuse--recover approach. By independently training modality specialists and then fusing their task vectors, the framework achieves effective retrieval across text, image, video, document, and audio inputs, although it reveals a significant issue known as Projector Drift that can degrade audio retrieval performance. To mitigate this, the authors implement Projector Recovery, which involves fine-tuning the projector while keeping the backbone frozen, resulting in improved retrieval scores across multiple modalities.

Key Contribution

Projector Drift reveals a hidden vulnerability in omni-modal systems that can significantly impair audio retrieval, but a simple fine-tuning strategy can effectively address it.

Abstract

Omni-modal retrieval promises a single embedding space for text, image, video, document, and audio inputs, but building such a unified retriever is difficult since these modalities differ in data distribution, architecture, and optimization dynamics. In this work, we present Conan-embedding-v3, a decouple--fuse--recover framework for omni-modal retrieval. Conan-embedding-v3 first trains modality specialists independently and fuses their task vectors into a single dense backbone, a strategy we call Decoupled Specialist Fusion. We show that this fusion composes visual, video, and document retrieval capabilities, but also exposes a failure mode for projector-based modalities: when audio is attached through an external encoder and projector, fusing the backbone leaves the projector calibrated to the audio-specialist backbone, causing a large audio retrieval regression despite copying all audio-specific modules unchanged. We call this failure Projector Drift. To repair it, Conan-embedding-v3 applies Projector Recovery (i.e., full-parameter fine-tuning of the projector while keeping the backbone frozen) followed by balanced multi-modal rehearsal. The resulting model supports these retrieval pathways in one backbone, achieving 74.9 scores on MMEB while obtaining 55.61 on the 30-task MAEB audio suite.

Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

Related Papers