Mar 31, 2026arXiv:2603.29346

L-ReLF: A Framework for Lexical Dataset Creation

AI Summary

The paper introduces L-ReLF, a reproducible framework for creating structured lexical datasets for low-resource languages, addressing the challenge of inconsistent terminology in languages like Moroccan Darija. It details a pipeline that handles source identification, OCR (despite its bias), and rigorous post-processing to create Wikidata Lexeme-compatible datasets. The framework's generalizability provides a pathway for other language communities to build foundational lexical data for NLP applications.

Key Contribution

Unlock knowledge equity for underserved languages: L-ReLF offers a reproducible recipe for creating high-quality lexical datasets where they're needed most.

Abstract

This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.

Data Curation & Synthetic Data Natural Language Processing Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

L-ReLF: A Framework for Lexical Dataset Creation

Related Papers