RWTHMar 2, 2026arXiv:2603.02285

Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

Zijian Yang, Jörg Barkoczi, Ralf Schlüter, Hermann Ney

AI Summary

This paper introduces a theoretical framework based on classification error bounds to analyze the feasibility of unsupervised speech recognition (USR) with unpaired data. It identifies two conditions under which USR is possible and discusses their necessity. The authors derive a classification error bound under these conditions and propose a sequence-level cross-entropy loss motivated by the bound, validating it through simulations.

Key Contribution

Unsupervised speech recognition is possible under specific theoretical conditions, paving the way for training models without paired data.

Abstract

Unsupervised speech recognition is a task of training a speech recognition model with unpaired data. To determine when and how unsupervised speech recognition can succeed, and how classification error relates to candidate training objectives, we develop a theoretical framework for unsupervised speech recognition grounded in classification error bounds. We introduce two conditions under which unsupervised speech recognition is possible. The necessity of these conditions are also discussed. Under these conditions, we derive a classification error bound for unsupervised speech recognition and validate this bound in simulations. Motivated by this bound, we propose a single-stage sequence-level cross-entropy loss for unsupervised speech recognition.

Speech & Audio Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

Related Papers