Mar 30, 2026arXiv:2603.28723

Acoustic-to-articulatory Inversion of the Complete Vocal Tract from RT-MRI with Various Audio Embeddings and Dataset Sizes

Sofiane Azzouz, Sofiane Azzouz, P. Vuissoz, Pierre-André Vuissoz, Yves Laprie, Yves Laprie

AI Summary

This paper presents a Bi-LSTM model to invert speech acoustics to complete vocal tract articulations derived from real-time MRI (RT-MRI) data. The approach uses automatically extracted articulator contours from RT-MRI images as geometric representations of the vocal tract, combined with denoised audio embeddings. Experiments evaluate the impact of different audio embeddings (MFCCs, LCCs, HuBERT) and dataset sizes on the accuracy of the inversion, achieving an average RMSE of 1.48mm.

Key Contribution

Unlock a complete picture of vocal tract articulation from speech using MRI data, surpassing the limitations of traditional sensor-based methods.

Abstract

Articulatory-to-acoustic inversion strongly depends on the type of data used. While most previous studies rely on EMA, which is limited by the number of sensors and restricted to accessible articulators, we propose an approach aiming at a complete inversion of the vocal tract, from the glottis to the lips. To this end, we used approximately 3.5 hours of RT-MRI data from a single speaker. The innovation of our approach lies in the use of articulator contours automatically extracted from MRI images, rather than relying on the raw images themselves. By focusing on these contours, the model prioritizes the essential geometric dynamics of the vocal tract while discarding redundant pixel-level information. These contours, alongside denoised audio, were then processed using a Bi-LSTM architecture. Two experiments were conducted: (1) the analysis of the impact of the audio embedding, for which three types of embeddings were evaluated as input to the model (MFCCs, LCCs, and HuBERT), and (2) the study of the influence of the dataset size, which we varied from 10 minutes to 3.5 hours. Evaluation was performed on the test data using RMSE, median error, as well as Tract Variables, to which we added an additional measurement: the larynx height. The average RMSE obtained is 1.48\,mm, compared with the pixel size (1.62\,mm). These results confirm the feasibility of a complete vocal-tract inversion using RT-MRI data.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Acoustic-to-articulatory Inversion of the Complete Vocal Tract from RT-MRI with Various Audio Embeddings and Dataset Sizes

Related Papers