CMU MLNorth-West UniversityMar 29, 2026arXiv:2603.27651

Budget-Xfer: Budget-Constrained Source Language Selection for Cross-Lingual Transfer to African Languages

Tewodros Kederalah Idris, Roald Eiselen, Prasenjit Mitra

AI Summary

The paper introduces Budget-Xfer, a framework for optimizing source language selection in cross-lingual transfer learning under a fixed annotation budget. It formulates multi-source transfer as a resource allocation problem, jointly optimizing which source languages to include and how much data to allocate from each. Experiments on NER and sentiment analysis for three African languages reveal that multi-source transfer significantly outperforms single-source transfer, and that embedding similarity is not always a reliable proxy for source language selection.

Key Contribution

Forget hand-picking your cross-lingual training data: a budget-constrained optimization can automatically allocate resources across multiple source languages, boosting performance on African languages by a large margin.

Abstract

Cross-lingual transfer learning enables NLP for low-resource languages by leveraging labeled data from higher-resource sources, yet existing comparisons of source language selection strategies do not control for total training data, confounding language selection effects with data quantity effects. We introduce Budget-Xfer, a framework that formulates multi-source cross-lingual transfer as a budget-constrained resource allocation problem. Given a fixed annotation budget B, our framework jointly optimizes which source languages to include and how much data to allocate from each. We evaluate four allocation strategies across named entity recognition and sentiment analysis for three African target languages (Hausa, Yoruba, Swahili) using two multilingual models, conducting 288 experiments. Our results show that (1) multi-source transfer significantly outperforms single-source transfer (Cohen's d = 0.80 to 1.98), driven by a structural budget underutilization bottleneck; (2) among multi-source strategies, differences are modest and non-significant; and (3) the value of embedding similarity as a selection proxy is task-dependent, with random selection outperforming similarity-based selection for NER but not sentiment analysis.

Data Curation & Synthetic Data Natural Language Processing Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Budget-Xfer: Budget-Constrained Source Language Selection for Cross-Lingual Transfer to African Languages

Related Papers