Mar 15, 2026arXiv:2603.14563

Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models

AI Summary

The authors introduce Multilingual TinyStories, a synthetically generated dataset of children's stories in 17 Indic languages, designed for training and evaluating small language models (SLMs). They use a hybrid curation pipeline combining the Sarvam-M language model with combinatorial prompt engineering for native generation, and Google Translate API for cross-lingual expansion. The resulting corpus contains 132,942 stories and over 93.9 million tokens, offering a resource for multilingual language modeling and transfer learning.

Key Contribution

Training SLMs for low-resource Indic languages just got easier: a new synthetic dataset of children's stories offers a large, localized, and simple corpus.

Abstract

The development of robust language models for low-resource languages is frequently bottlenecked by the scarcity of high-quality, coherent, and domain-appropriate training corpora. In this paper, we introduce the Multilingual TinyStories dataset, a large-scale, synthetically generated collection of children's stories encompassing 17 Indian languages. Designed specifically for the training and evaluation of Small Language Models (SLMs), the corpus provides simple, narrative-driven text strictly localized to native scripts. We detail our hybrid curation pipeline, which leverages the Sarvam-M language model and a novel combinatorial prompt engineering framework for native generation, coupled with the Google Translate API for large-scale cross-lingual expansion. Through strict programmatic filtering, we compiled 132,942 stories and over 93.9 million tokens in our release, serving as a foundational resource for multilingual language modeling and transfer learning in the Indic linguistic sphere.

Data Curation & Synthetic Data Natural Language Processing Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models

Related Papers