Search papers, labs, and topics across Lattice.
The paper investigates improving cross-lingual alignment in multilingual models by training with multi-way parallel corpora. They construct a novel multi-way parallel dataset by translating English text into six languages using NMT and then apply contrastive learning to align representations. Results show significant performance improvements on MTEB tasks for XLM-RoBERTa and multilingual BERT, particularly in bitext mining, semantic similarity, and classification, demonstrating the effectiveness of multi-way parallel data for cross-lingual representation learning.
Multi-way parallel data dramatically boosts cross-lingual alignment in multilingual models, outperforming bilingual data by up to 28% on NLU tasks.
Multilingual pretraining typically lacks explicit alignment signals, leading to suboptimal cross-lingual alignment in the representation space. In this work, we show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus in a diverse pool of languages can substantially improve multilingual and cross-lingual representations for NLU tasks. We construct a multi-way parallel dataset using translations of English text from an off-the-shelf NMT model for a pool of six target languages and achieve strong cross-lingual alignment through contrastive learning. This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models. Using a multi-way parallel corpus for contrastive training yields substantial gains on bitext mining (21.3%), semantic similarity (5.3%), and classification (28.4%) compared to English-centric (En-X) bilingually parallel data, where X is sampled from a pool of multiple target languages. Furthermore, finetuning mE5 model on a small dataset with multi-way parallelism significantly improves bitext mining compared to one without, underscoring the importance of multi-way cross-lingual supervision even for models already pretrained for high-quality sentence embeddings.