Search papers, labs, and topics across Lattice.
The paper introduces Voxtral Realtime, a novel automatic speech recognition (ASR) model designed for low-latency, high-accuracy streaming transcription. Unlike existing approaches that adapt offline models, Voxtral Realtime is trained end-to-end for streaming, incorporating explicit audio-text alignment. The model leverages the Delayed Streams Modeling framework, featuring a new causal audio encoder and Ada RMS-Norm, and achieves performance comparable to Whisper at a 480ms delay, while supporting 13 languages.
Finally, a streaming ASR model matches Whisper's offline transcription quality while maintaining sub-second latency.
We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.