Mar 16, 2026arXiv:2603.15523

SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

AI Summary

The paper introduces SlovKE, a new large-scale dataset for Slovak keyphrase extraction, comprising 227,432 scientific abstracts. It benchmarks unsupervised methods and a GPT-3.5-turbo-based approach (KeyLLM) on this dataset, finding that unsupervised methods struggle with morphological variations. KeyLLM better matches author-assigned keyphrases, capturing relevant concepts missed by exact-match evaluation, highlighting the challenges of morphologically rich languages.

Key Contribution

LLMs can extract better keyphrases in morphologically rich languages like Slovak because they are less sensitive to surface-form variations that stymie traditional methods.

Abstract

Keyphrase extraction for morphologically rich, low-resource languages remains understudied, largely due to the scarcity of suitable evaluation datasets. We address this gap for Slovak by constructing a dataset of 227,432 scientific abstracts with author-assigned keyphrases -- scraped and systematically cleaned from the Slovak Central Register of Theses -- representing a 25-fold increase over the largest prior Slovak resource and approaching the scale of established English benchmarks such as KP20K. Using this dataset, we benchmark three unsupervised baselines (YAKE, TextRank, KeyBERT with SlovakBERT embeddings) and evaluate KeyLLM, an LLM-based extraction method using GPT-3.5-turbo. Unsupervised baselines achieve at most 11.6\% exact-match $F1@6$, with a large gap to partial matching (up to 51.5\%), reflecting the difficulty of matching inflected surface forms to author-assigned keyphrases. KeyLLM narrows this exact--partial gap, producing keyphrases closer to the canonical forms assigned by authors, while manual evaluation on 100 documents ($κ= 0.61$) confirms that KeyLLM captures relevant concepts that automated exact matching underestimates. Our analysis identifies morphological mismatch as the dominant failure mode for statistical methods -- a finding relevant to other inflected languages. The dataset (https://huggingface.co/datasets/NaiveNeuron/SlovKE) and evaluation code (https://github.com/NaiveNeuron/SlovKE) are publicly available.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

Related Papers