Search papers, labs, and topics across Lattice.
The authors introduce Multilingual TinyStories, a synthetically generated dataset of children's stories in 17 Indic languages, designed for training and evaluating small language models (SLMs). They use a hybrid curation pipeline combining the Sarvam-M language model with combinatorial prompt engineering for native generation, and Google Translate API for cross-lingual expansion. The resulting corpus contains 132,942 stories and over 93.9 million tokens, offering a resource for multilingual language modeling and transfer learning.
Training SLMs for low-resource Indic languages just got easier: a new synthetic dataset of children's stories offers a large, localized, and simple corpus.
The development of robust language models for low-resource languages is frequently bottlenecked by the scarcity of high-quality, coherent, and domain-appropriate training corpora. In this paper, we introduce the Multilingual TinyStories dataset, a large-scale, synthetically generated collection of children's stories encompassing 17 Indian languages. Designed specifically for the training and evaluation of Small Language Models (SLMs), the corpus provides simple, narrative-driven text strictly localized to native scripts. We detail our hybrid curation pipeline, which leverages the Sarvam-M language model and a novel combinatorial prompt engineering framework for native generation, coupled with the Google Translate API for large-scale cross-lingual expansion. Through strict programmatic filtering, we compiled 132,942 stories and over 93.9 million tokens in our release, serving as a foundational resource for multilingual language modeling and transfer learning in the Indic linguistic sphere.