Search papers, labs, and topics across Lattice.
This paper introduces AfriSUD, the first extensive collection of syntactically annotated treebanks for nine African languages, addressing the significant underrepresentation of these languages in NLP research. By employing the Surface-Syntactic Universal Dependencies framework, the authors provide high-quality, verified data that highlight unique typological features such as agglutination and tone. Evaluation of various models, including non-transformer baselines and LLMs, reveals a substantial syntax gap, indicating that current architectures struggle to accommodate the structural diversity of African languages.
Models trained on AfriSUD reveal a striking syntax gap, highlighting the inadequacy of existing architectures for capturing the complexities of African languages.
Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support NLP. We aim to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages spanning major language families and regions across Sub-Saharan Africa. Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker verified data that capture typological key features such as agglutination and tone. We evaluate a range of models on AfriSUD for part-of-speech tagging and dependency parsing including non-transformer baselines, multilingual pretrained encoders, and LLMs. Our results reveal a significant syntax gap, where models still show clear limitations across the nine languages, suggesting that existing architectures may not fully capture the structural diversity of African-language syntax.