Search papers, labs, and topics across Lattice.
The paper introduces A Bolu, the first structured corpus of Sardinian extemporaneous poetry (cantada logudorese), comprising 2,835 stanzas and 141,321 tokens. They perform a multidimensional analysis using descriptive statistics and computational linguistics to characterize the poetic text. The analysis reveals recurring patterns in the poetry, supporting the theory of formulaicity in oral traditions.
A Bolu unveils the hidden structure within Sardinian improvisational poetry, revealing recurring patterns that challenge our understanding of oral creativity and offer a new dataset for NLP research on minority languages.
The growing interest of Natural Language Processing (NLP) in minority languages has not yet bridged the gap in the preservation of oral linguistic heritage. In particular, extemporaneous poetry - a performative genre based on real-time improvisation, metrical-rhetorical competence - remains a largely unexplored area of computational linguistics. This methodological gap necessitates the creation of specific resources to document and analyse the structures of improvised poetry. This is the context in which A Bolu was created, the first structured corpus of extemporaneous poetry dedicated to cantada logudorese, a variant of the Sardinian language. The dataset comprises 2,835 stanzas for a total of 141,321 tokens. The study presents the architecture of the corpus and applies a multidimensional analysis combining descriptive statistical indices and computational linguistics techniques to map the characteristics of the poetic text. The results indicate that the production of Sardinian extemporaneous poets is characterised by recurring patterns that support Parry and Lord's theory of formulaicity. This evidence not only provides a new key to understanding oral creativity, but also offers a significant contribution to the development of NLP tools that are more inclusive and sensitive to the specificities of less widely spoken languages.