Search papers, labs, and topics across Lattice.
This paper introduces Speech Translation Error Labelling (STEL), a novel annotation protocol aimed at evaluating the confidence and quality of speech translations. The authors create a small, authentic dataset and analyze the performance of both text-only and multimodal systems on the STEL task, revealing that while these systems achieve approximately half the precision of human annotators, they are complementary in identifying different types of errors. The findings underscore the necessity of direct speech processing for effective error labeling in speech translation systems.
Text-only and multimodal LLMs achieve only half the precision of humans in labeling speech translation errors, highlighting a significant gap in current evaluation methodologies.
Errors in speech translations reduce trustworthiness of Speech Translation (ST) systems and can have serious consequences. Yet currently there is no established methodology for evaluating confidence and quality estimation of speech translations. To initiate progress in this direction, we propose Speech Translation Error Labelling (STEL). We create an annotation protocol, a small authentic end-to-end evaluation dataset, and we analyse how existing text-only and speech-processing systems perform the STEL task. Our results show that text-only XCOMET and multimodal LLM Qwen2.5-Omni are able to perform the STEL task in roughly half the precision of humans. We also find that direct speech processing is necessary for the STEL task, and that the current text-only and speech-processing systems are complementary in labelling translation-only vs. speech-processing errors in ST.