The Árni Magnússon Institute for IcelandicUniversity of IcelandUniversity of ZürichMar 17, 2026arXiv:2603.16406

Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

Finnur Ágúst Ingimundarson, Steinunn Rut Friðriksdóttir, Bjarki Ármannsson, Iris Edda Nowenstein, Steinþór Steingrímsson

AI Summary

This paper investigates the validity of LLM benchmarks for Icelandic, revealing significant flaws in benchmarks relying on synthetic or machine-translated data. Through quantitative error analysis, the authors demonstrate substantial differences in quality between human-authored/translated benchmarks and those using synthetic or machine-translated data. The findings highlight the risk of skewed results and undermined validity when using unverified synthetic or machine-translated data in low/medium-resource language benchmarking.

Key Contribution

LLM benchmarks in low-resource languages are likely garbage, with synthetic or machine-translated data introducing severe flaws that skew results.

Abstract

This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests' validity. We warn against the use of such methods without verification in low/medium-resource settings as the translation quality can, at best, only be as good as MT quality for a given language at any given time. Indeed, the results of our quantitative error analysis on existing benchmarks for Icelandic show clear differences between human-authored/-translated benchmarks vs. synthetic or machine-translated benchmarks.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

Related Papers