Search papers, labs, and topics across Lattice.
The paper introduces L-ReLF, a reproducible framework for creating structured lexical datasets for low-resource languages, addressing the challenge of inconsistent terminology in languages like Moroccan Darija. It details a pipeline that handles source identification, OCR (despite its bias), and rigorous post-processing to create Wikidata Lexeme-compatible datasets. The framework's generalizability provides a pathway for other language communities to build foundational lexical data for NLP applications.
Unlock knowledge equity for underserved languages: L-ReLF offers a reproducible recipe for creating high-quality lexical datasets where they're needed most.
This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.