Search papers, labs, and topics across Lattice.
This paper introduces an abstention-aware ASR framework that allows models to abstain from transcribing uncertain segments, improving reliability. They propose a new metric, RAS, to evaluate ASR reliability by balancing informativeness and error aversion, calibrated with human preferences. Experiments show that training with supervised bootstrapping and reinforcement learning significantly improves transcription reliability without sacrificing accuracy.
ASR systems can now be more trustworthy: this work shows how to train them to abstain from transcribing uncertain segments, leading to more reliable outputs.
Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate focuses solely on accuracy and fails to capture transcription reliability. We introduce an abstention-aware transcription framework that enables ASR models to explicitly abstain from uncertain segments. To evaluate reliability under abstention, we propose RAS, a reliability-oriented metric that balances transcription informativeness and error aversion, with its trade-off parameter calibrated by human preference. We then train an abstention-aware ASR model through supervised bootstrapping followed by reinforcement learning. Our experiments demonstrate substantial improvements in transcription reliability while maintaining competitive accuracy.