ChalmersCIFARImperialINRIAMcGillPenn StatePrincetonTrentoUMNJun 10, 2026arXiv:2606.12708

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

Happy Buzaaba, Cheikh Mouhamadou Bamba Dione, David Ifeoluwa Adelani, Sylvain Kahane, Kim Gerdes, Bruno Guillaume, Kevin Guan, Aremu Anuoluwapo, Naome A. Etori, Shamsuddeen Hassan Muhammad, Utitofon Inyang, Peter Nabende, David Sabiiti Bamutura, Andiswa Bukula, Chinedu Uchechukwu, Rooweither Mabuya, Idris Akinade, Christiane Fellbaum

AI Summary

This paper introduces AfriSUD, the first extensive collection of syntactically annotated treebanks for nine African languages, addressing the significant underrepresentation of these languages in NLP research. By employing the Surface-Syntactic Universal Dependencies framework, the authors provide high-quality, verified data that highlight unique typological features such as agglutination and tone. Evaluation of various models, including non-transformer baselines and LLMs, reveals a substantial syntax gap, indicating that current architectures struggle to accommodate the structural diversity of African languages.

Key Contribution

Models trained on AfriSUD reveal a striking syntax gap, highlighting the inadequacy of existing architectures for capturing the complexities of African languages.

Abstract

Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support NLP. We aim to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages spanning major language families and regions across Sub-Saharan Africa. Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker verified data that capture typological key features such as agglutination and tone. We evaluate a range of models on AfriSUD for part-of-speech tagging and dependency parsing including non-transformer baselines, multilingual pretrained encoders, and LLMs. Our results reveal a significant syntax gap, where models still show clear limitations across the nine languages, suggesting that existing architectures may not fully capture the structural diversity of African-language syntax.

Data Curation & Synthetic Data Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

Related Papers