Search papers, labs, and topics across Lattice.
This paper introduces a theoretical framework based on classification error bounds to analyze the feasibility of unsupervised speech recognition (USR) with unpaired data. It identifies two conditions under which USR is possible and discusses their necessity. The authors derive a classification error bound under these conditions and propose a sequence-level cross-entropy loss motivated by the bound, validating it through simulations.
Unsupervised speech recognition is possible under specific theoretical conditions, paving the way for training models without paired data.
Unsupervised speech recognition is a task of training a speech recognition model with unpaired data. To determine when and how unsupervised speech recognition can succeed, and how classification error relates to candidate training objectives, we develop a theoretical framework for unsupervised speech recognition grounded in classification error bounds. We introduce two conditions under which unsupervised speech recognition is possible. The necessity of these conditions are also discussed. Under these conditions, we derive a classification error bound for unsupervised speech recognition and validate this bound in simulations. Motivated by this bound, we propose a single-stage sequence-level cross-entropy loss for unsupervised speech recognition.