Search papers, labs, and topics across Lattice.
The paper introduces Budget-Xfer, a framework for optimizing source language selection in cross-lingual transfer learning under a fixed annotation budget. It formulates multi-source transfer as a resource allocation problem, jointly optimizing which source languages to include and how much data to allocate from each. Experiments on NER and sentiment analysis for three African languages reveal that multi-source transfer significantly outperforms single-source transfer, and that embedding similarity is not always a reliable proxy for source language selection.
Forget hand-picking your cross-lingual training data: a budget-constrained optimization can automatically allocate resources across multiple source languages, boosting performance on African languages by a large margin.
Cross-lingual transfer learning enables NLP for low-resource languages by leveraging labeled data from higher-resource sources, yet existing comparisons of source language selection strategies do not control for total training data, confounding language selection effects with data quantity effects. We introduce Budget-Xfer, a framework that formulates multi-source cross-lingual transfer as a budget-constrained resource allocation problem. Given a fixed annotation budget B, our framework jointly optimizes which source languages to include and how much data to allocate from each. We evaluate four allocation strategies across named entity recognition and sentiment analysis for three African target languages (Hausa, Yoruba, Swahili) using two multilingual models, conducting 288 experiments. Our results show that (1) multi-source transfer significantly outperforms single-source transfer (Cohen's d = 0.80 to 1.98), driven by a structural budget underutilization bottleneck; (2) among multi-source strategies, differences are modest and non-significant; and (3) the value of embedding similarity as a selection proxy is task-dependent, with random selection outperforming similarity-based selection for NER but not sentiment analysis.