Jun 11, 2026arXiv:2606.13647

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

Marek vSuppa, Marek Šuppa, Andrej Ridzik, Andrej Ridzik, Daniel Hládek, D. Hládek, Natália Kňažeková, Nat'alia Kvnavzekov'a, Viktória Ondrejová, Viktoria Ondrejova

AI Summary

The paper introduces SkMTEB, the first extensive benchmark for Slovak text embeddings, featuring 31 datasets across seven task types, significantly expanding the evaluation landscape for this low-resource language. Through an analysis of 31 embedding models, it finds that large instruction-tuned multilingual models outperform existing Slovak-specific models in embedding tasks. To enhance local deployment capabilities, the authors develop two efficient models, \texttt{e5-sk-small} and \texttt{e5-sk-large}, which, despite being smaller, deliver competitive performance compared to proprietary solutions.

Key Contribution

Large multilingual models outperform Slovak-specific embeddings, but new compact models offer a local deployment solution without sacrificing performance.

Abstract

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References57

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

Related Papers