Search papers, labs, and topics across Lattice.
This paper introduces CS-YODAS, a large-scale dataset of 313 hours of naturally occurring code-switched speech mined from multilingual YouTube videos, addressing the scarcity of diverse and authentic resources in this area. The dataset is generated through a scalable, human-in-the-loop pipeline that identifies and validates instances of code-switching, providing a rich resource for studying spontaneous language alternation. Analysis of the dataset reveals insights into language-pair frequencies and switching patterns, establishing baseline results for spoken language identification in code-switched contexts.
A groundbreaking dataset of 313 hours of real-world code-switched speech reveals rich patterns and frequencies previously overlooked in multilingual research.
We present CS-YODAS, a Creative Commons-licensed dataset of in-the-wild code-switched speech mined from multilingual YouTube data. Code-switching (CS), or the alternation between languages within an utterance or conversation, is common in multilingual settings but remains underrepresented in existing CS speech resources, which are typically small, domain-specific, or artificially constructed. Building on the YODAS corpus, we develop a scalable, human-in-the-loop pipeline for identifying and validating naturally occurring code-switching. The resulting dataset, which totals 313 hours and spans 7 matrix languages, provides diverse, real-world examples of spontaneous code-switched speech. We further analyze the distribution and characteristics of code-switching in the wild, examining language-pair frequencies and switching patterns, and report baseline results for spoken language identification. We hope that CS-YODAS will encourage broader and more comprehensive research on code-switched speech. Dataset link: https://huggingface.co/datasets/byan/cs-yodas.