MIT CSAILBristolChalmersKling TeamKuaishouRUCSkoltechValeo AIFeb 11, 2026arXiv:2602.11298

Voxtral Realtime

Alexander H. Liu, Andy Ehrenberg, Andy Lo, Chen-Yo Sun, Guillaume Lample, Jean-Malo Delignon, K. Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, Rohin Arora, Sanchit Gandhi, Sandeep Subramanian, Soham Ghosh, Srijan Mishra, Abhinav Rastogi, Alan Jeffares, Albert Q. Jiang, Alexandre Sablayrolles, Am'elie H'eliou, Andrew Bai, Angele Lenglemetz, Anton Eliseev, Antonia Calvi, Arjun Majumdar, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Benjamin Tibi, Clémence Lanfranchi, Connor Chen, Corentin Barreau, Corentin Sautier, Cyprien Courtot, Darius Dabert, Diego de Las Casas, Elliot Chane-Sane, Enguerrand Paquin, Faruk Ahmed, Federico Baldassarre, Gabrielle Berrada, Gaetan Ecrepont, Gauthier Guinet, Genevieve Hayes, Georgii Sergeevich Novikov, G. Pistilli, Guillaume Martin, Gunjan Dhanuka, Gunshi Gupta, Han Zhou, Indraneel Mukherjee, Irene Zhang, Jaeyoung Kim, Jan Ludziejewski, Jason Rute, Joachim Studnia, John Harvill, Jonas Amar, Josselin Somerville Roberts, Julien Tauran, Karmesh Yadav, Kartik Khandelwal, Kush Jain, Laurence Aitchison, Léonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Maarten Buyl, Manan Sharma, Margaret Jennings, Marie Pellat, Mark Prins, Mathieu Poir'ee, Mathilde Guillaumin, Matthieu Dinot, Matthieu Futeral, Maxime Darrin, Maximilian Augustin, Mert Unsal, Mia Chiquier, Nathan Grinsztajn, Neha Gupta, Olivier Bousquet, Olivier Duchenne, Patricia Wang, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Philomène Chagniot, Pierre Stock, Piotr Milo's, Prateek Gupta, Pravesh Agrawal, Quentin Torroba, Ram Ramrakhya, Rishi Shah, Romain Sauvestre, Roman Soletskyi, Rosalie Millner, S. Vaze, Samuel Humeau, Siddharth Gandhi, Sumukh Aithal, Szymon Antoniak, Teven Le Scao, Théo Cachet, Theo Simon Sorg, Thibaut Lavril, Thomas Chabal, Thomas Foubert, Thomas Robert, Thomas Wang, Tim Lawson, Tom Bewley, Tom Edwards, Tyler Wang, Valeriia Nemychnikova, Van Phung, Vedant Nanda, Victor Jouault, Virgile Richard, Vladislav V. Bataev, Wassim Bouaziz, Wen-Ding Li, William Marshall, Xinghui Li, Xingran Guo, Xinyu Yang, Yannic Neuhaus, Yihan Wang, Zaccharie Ramzi, Zhenlin Xu

AI Summary

The paper introduces Voxtral Realtime, a novel automatic speech recognition (ASR) model designed for low-latency, high-accuracy streaming transcription. Unlike existing approaches that adapt offline models, Voxtral Realtime is trained end-to-end for streaming, incorporating explicit audio-text alignment. The model leverages the Delayed Streams Modeling framework, featuring a new causal audio encoder and Ada RMS-Norm, and achieves performance comparable to Whisper at a 480ms delay, while supporting 13 languages.

Key Contribution

Finally, a streaming ASR model matches Whisper's offline transcription quality while maintaining sub-second latency.

Abstract

We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References30

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Voxtral Realtime

Related Papers