Mar 30, 2026arXiv:2603.29026

On the limited utility of parallel data for learning shared multilingual representations

Julius Leino, Julius Leino, Jörg Tiedemann, Jörg Tiedemann

AI Summary

The paper investigates the impact of parallel data on cross-lingual alignment in multilingual pretraining. By training models with varying amounts of parallel data, the authors find that parallel data has a surprisingly limited effect on the final cross-lingual alignment achieved by the model. The primary benefits appear to be accelerating alignment early in training and reducing language-specific neurons, but comparable alignment emerges even without parallel data.

Key Contribution

Forget massive parallel datasets: cross-lingual alignment in multilingual models emerges almost as effectively without them.

Abstract

Shared multilingual representations are essential for cross-lingual tasks and knowledge transfer across languages. This study looks at the impact of parallel data, i.e. translated sentences, in pretraining as a signal to trigger representations that are aligned across languages. We train reference models with different proportions of parallel data and show that parallel data seem to have only a minimal effect on the cross-lingual alignment. Based on multiple evaluation methods, we find that the effect is limited to potentially accelerating the representation sharing in the early phases of pretraining, and to decreasing the amount of language-specific neurons in the model. Cross-lingual alignment seems to emerge on similar levels even without the explicit signal from parallel data.

Data Curation & Synthetic Data Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

On the limited utility of parallel data for learning shared multilingual representations

Related Papers