Search papers, labs, and topics across Lattice.
The paper introduces SkMTEB, the first extensive benchmark for Slovak text embeddings, featuring 31 datasets across seven task types, significantly expanding the evaluation landscape for this low-resource language. Through an analysis of 31 embedding models, it finds that large instruction-tuned multilingual models outperform existing Slovak-specific models in embedding tasks. To enhance local deployment capabilities, the authors develop two efficient models, \texttt{e5-sk-small} and \texttt{e5-sk-large}, which, despite being smaller, deliver competitive performance compared to proprietary solutions.
Large multilingual models outperform Slovak-specific embeddings, but new compact models offer a local deployment solution without sacrificing performance.
We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.