Feb 16, 2026arXiv:2602.14819

Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Matteo Rinaldi, Rossella Varvara, Viviana Patti

AI Summary

The paper introduces Testimole-conversational, a 30-billion-word corpus of Italian discussion board messages spanning from 1996 to 2024. This large-scale dataset is designed to facilitate the pre-training of Italian Large Language Models and enable sociolinguistic research on computer-mediated communication. The corpus captures a wide range of informal written Italian and online social interactions over a significant time period, making it a valuable resource for NLP and social science research.

Key Contribution

A massive 30B-word Italian discussion board corpus is now available, unlocking new possibilities for training native Italian LLMs and studying online social dynamics.

Abstract

We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards' messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.

Data Curation & Synthetic Data Natural Language Processing Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Related Papers