IdiapLe Mans UniversityApr 23, 2026arXiv:2604.21928

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil, Sergio Gastón Burdisso, Petr Motlícek, Shiran Liu, Mickael Rouvier, Jane Wottawa, Richard Dufour

AI Summary

This paper explores the use of decoder-based Large Language Models (LLMs) for evaluating Automatic Speech Recognition (ASR) quality, moving beyond traditional Word Error Rate (WER). They evaluate LLMs through hypothesis selection, semantic distance computation using generative embeddings, and qualitative error classification. Results on the HATS dataset show that LLMs achieve significantly higher agreement with human annotators (92-94%) in hypothesis selection compared to WER (63%) and other semantic metrics, demonstrating their potential for more human-aligned ASR evaluation.

Key Contribution

LLMs can judge speech recognition quality with near-human accuracy, blowing away traditional metrics like Word Error Rate.

Abstract

Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

Related Papers