The Harker SchoolMar 31, 2026arXiv:2603.29552

Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

Linda Zeng, Steven Y. Feng, Michael C. Frank

AI Summary

The authors trained GPT-2 models on matched 100M-word mono- and bilingual datasets to simulate language acquisition under controlled exposure conditions. They investigated the impact of different bilingual exposure regimes on perplexity, grammaticality, and semantic knowledge. Results indicate that bilingual models perform comparably to monolingual models in one language while also demonstrating strong performance in the second, suggesting no inherent challenges for statistical learners in bilingual environments.

Key Contribution

Bilingual language models can achieve performance comparable to monolingual models in both languages, challenging the assumption that bilingual input poses significant learning obstacles.

Abstract

Multilingualism is incredibly common around the world, leading to many important theoretical and practical questions about how children learn multiple languages at once. For example, does multilingual acquisition lead to delays in learning? Are there better and worse ways to structure multilingual input? Many correlational studies address these questions, but it is surprisingly difficult to get definitive answers because children cannot be randomly assigned to be multilingual and data are typically not matched between languages. We use language model training as a method for simulating a variety of highly controlled exposure conditions, and create matched 100M-word mono- and bilingual datasets using synthetic data and machine translation. We train GPT-2 models on monolingual and bilingual data organized to reflect a range of exposure regimes, and evaluate their performance on perplexity, grammaticality, and semantic knowledge. Across model scales and measures, bilingual models perform similarly to monolingual models in one language, but show strong performance in the second language as well. These results suggest that there are no strong differences between different bilingual exposure regimes, and that bilingual input poses no in-principle challenges for agnostic statistical learners.

Data Curation & Synthetic Data Natural Language Processing Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

Related Papers