Mar 2, 2026arXiv:2603.01732

Bootstrapping Embeddings for Low Resource Languages

Merve Basoz, Merve Basoz, A. Horne, Andrew Horne, Mattia Opper, Mattia Opper

AI Summary

The paper explores methods for generating synthetic triplet data to train embedding models for low-resource languages, addressing the lack of supervised finetuning data in these languages. They introduce two novel approaches: adapter composition and cross-lingual finetuning of the LLM generator (XL-LoRA). Experiments show that adapter composition and XL-LoRA significantly improve performance across various tasks and languages, providing a scalable solution for creating embedding models in low-resource settings.

Key Contribution

Forget scarce data: adapter composition and XL-LoRA unlock surprisingly strong embeddings for low-resource languages by generating synthetic training triplets.

Abstract

Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.

Data Curation & Synthetic Data Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References51

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Bootstrapping Embeddings for Low Resource Languages

Related Papers