Tsinghua AIUMacauMar 5, 2026arXiv:2603.04772

TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

Yebo Wu, Fenglin Liu, Ziwei Xie, Zhiyuan Liu, Changwang Zhang, Jun Wang, Li Li

AI Summary

The paper introduces TSEmbed, a universal multimodal embedding framework that uses a Mixture-of-Experts (MoE) architecture combined with LoRA to disentangle conflicting task objectives when adapting Multimodal Large Language Models (MLLMs) for universal embedding tasks. They further propose Expert-Aware Negative Sampling (EANS), which uses expert routing distributions to identify informative hard negatives, improving the model's discriminative power. TSEmbed achieves state-of-the-art performance on MMEB and industrial datasets, demonstrating its effectiveness for task-level scaling in multimodal embeddings.

Key Contribution

Forget task-specific fine-tuning: TSEmbed unlocks SOTA multimodal embeddings by disentangling task objectives with a Mixture-of-Experts and a novel expert-aware negative sampling strategy.

Abstract

Despite the exceptional reasoning capabilities of Multimodal Large Language Models (MLLMs), their adaptation into universal embedding models is significantly impeded by task conflict. To address this, we propose TSEmbed, a universal multimodal embedding framework that synergizes Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to explicitly disentangle conflicting task objectives. Moreover, we introduce Expert-Aware Negative Sampling (EANS), a novel strategy that leverages expert routing distributions as an intrinsic proxy for semantic similarity. By dynamically prioritizing informative hard negatives that share expert activation patterns with the query, EANS effectively sharpens the model's discriminative power and refines embedding boundaries. To ensure training stability, we further devise a two-stage learning paradigm that solidifies expert specialization before optimizing representations via EANS. TSEmbed achieves state-of-the-art performance on both the Massive Multimodal Embedding Benchmark (MMEB) and real-world industrial production datasets, laying a foundation for task-level scaling in universal multimodal embeddings.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References54

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

Related Papers