Mar 16, 2026arXiv:2603.15326

Tagarela - A Portuguese speech dataset from podcasts

Frederico Santos de Oliveira, Lucas Rafael Stefanel Gris, Alef Iury Siqueira Ferreira, Augusto Seben da Rosa, Alexandre Costa Ferro Filho, Edresson Casanova, Christopher Dane Shulby, Rafael Teixeira Sousa, Diogo Fernandes Costa Silva, Anderson da Silva Soares, Arlindo Rodrigues Galvão Filho

AI Summary

The authors introduce TAGARELA, a new Portuguese speech dataset comprising 8,972 hours of podcast audio, designed for training ASR and TTS models. The dataset was created using a mixed strategy of audio pre-processing and ASR-based transcription, leveraging models pre-trained on high-fidelity proprietary API transcriptions. Experiments training ASR and TTS models solely on TAGARELA demonstrate its potential for advancing speech technologies in Portuguese.

Key Contribution

Rivaling English's GigaSpeech in scale, TAGARELA unlocks the potential for state-of-the-art Portuguese speech models with its nearly 9,000 hours of podcast audio.

Abstract

Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new dataset, named TAGARELA, composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. Notably, its scale rivals English's GigaSpeech (10kh), enabling state-of-the-art Portuguese models. To ensure data quality, the corpus was subjected to an audio pre-processing pipeline and subsequently transcribed using a mixed strategy: we applied ASR models that were previously trained on high-fidelity transcriptions generated by proprietary APIs, ensuring a high level of initial accuracy. Finally, to validate the effectiveness of this new resource, we present ASR and TTS models trained exclusively on our dataset and evaluate their performance, demonstrating its potential to drive the development of more robust and natural speech technologies for Portuguese. The dataset is released publicly, available at https://freds0.github.io/TAGARELA/, to foster the development of robust speech technologies.

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Tagarela - A Portuguese speech dataset from podcasts

Related Papers