Search papers, labs, and topics across Lattice.
The paper introduces a BETA-labeling framework for constructing a Bangla information retrieval (IR) dataset using multiple large language models (LLMs) as annotators, incorporating contextual alignment, consistency checks, and majority agreement to improve label quality. It investigates the feasibility of reusing IR datasets from other low-resource languages via machine translation, revealing substantial variation in meaning preservation and task validity across different language pairs. The study demonstrates the potential and limitations of LLM-assisted dataset creation for low-resource IR, highlighting the risks of cross-lingual dataset reuse.
LLM-generated labels for low-resource IR are surprisingly unreliable across languages, even with consistency checks and human evaluation, raising serious questions about cross-lingual dataset reuse.
IR in low-resource languages remains limited by the scarcity of high-quality, task-specific annotated datasets. Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity. This work presents a Bangla IR dataset constructed using a BETA-labeling framework involving multiple LLM annotators from diverse model families. The framework incorporates contextual alignment, consistency checks, and majority agreement, followed by human evaluation to verify label quality. Beyond dataset creation, we examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation. Using LLM-based translation across multiple language pairs, we experimented on meaning preservation and task validity between source and translated datasets. Our experiment reveal substantial variation across languages, reflecting language-dependent biases and inconsistent semantic preservation that directly affect the reliability of cross-lingual dataset reuse. Overall, this study highlights both the potential and limitations of LLM-assisted dataset creation for low-resource IR. It provides empirical evidence of the risks associated with cross-lingual dataset reuse and offers practical guidance for constructing more reliable benchmarks and evaluation pipelines in low-resource language settings.