CMU MLBrno University of TechnologyJHUKyotoSheffieldUT AustinJun 9, 2026arXiv:2606.11514

CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched Speech

Brian Yan, Qingzheng Wang, Matthew Wiesner, Anuj Diwan, Olga Iakovenko, Alexander Polok, Injy Hamed, Shuichiro Shimizu, Iris Emerman Thomas Hain, David R. Mortensen, Peter Viechnicki, Shinji Watanabe

AI Summary

This paper introduces CS-YODAS, a large-scale dataset of 313 hours of naturally occurring code-switched speech mined from multilingual YouTube videos, addressing the scarcity of diverse and authentic resources in this area. The dataset is generated through a scalable, human-in-the-loop pipeline that identifies and validates instances of code-switching, providing a rich resource for studying spontaneous language alternation. Analysis of the dataset reveals insights into language-pair frequencies and switching patterns, establishing baseline results for spoken language identification in code-switched contexts.

Key Contribution

A groundbreaking dataset of 313 hours of real-world code-switched speech reveals rich patterns and frequencies previously overlooked in multilingual research.

Abstract

We present CS-YODAS, a Creative Commons-licensed dataset of in-the-wild code-switched speech mined from multilingual YouTube data. Code-switching (CS), or the alternation between languages within an utterance or conversation, is common in multilingual settings but remains underrepresented in existing CS speech resources, which are typically small, domain-specific, or artificially constructed. Building on the YODAS corpus, we develop a scalable, human-in-the-loop pipeline for identifying and validating naturally occurring code-switching. The resulting dataset, which totals 313 hours and spans 7 matrix languages, provides diverse, real-world examples of spontaneous code-switched speech. We further analyze the distribution and characteristics of code-switching in the wild, examining language-pair frequencies and switching patterns, and report baseline results for spoken language identification. We hope that CS-YODAS will encourage broader and more comprehensive research on code-switched speech. Dataset link: https://huggingface.co/datasets/byan/cs-yodas.

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched Speech

Related Papers