Tencent AIFeb 12, 2026arXiv:2602.11477

SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

AI Summary

The paper introduces SLD-L2S, a novel lip-to-speech (L2S) framework based on a hierarchical subspace latent diffusion model that directly maps visual lip movements to the continuous latent space of a pre-trained neural audio codec, bypassing intermediate representations. The method employs a hierarchical architecture with parallel subspaces and a diffusion convolution block (DiCB) to enhance interactions within and between subspaces. By using reparameterized flow matching, the framework incorporates speech language model (SLM) and semantic losses during training, leading to state-of-the-art generation quality on benchmark datasets.

Key Contribution

Ditch mel-spectrograms: a hierarchical subspace latent diffusion model directly maps lip movements to audio codec latent space, achieving state-of-the-art lip-to-speech synthesis.

Abstract

Although lip-to-speech synthesis (L2S) has achieved significant progress in recent years, current state-of-the-art methods typically rely on intermediate representations such as mel-spectrograms or discrete self-supervised learning (SSL) tokens. The potential of latent diffusion models (LDMs) in this task remains largely unexplored. In this paper, we introduce SLD-L2S, a novel L2S framework built upon a hierarchical subspace latent diffusion model. Our method aims to directly map visual lip movements to the continuous latent space of a pre-trained neural audio codec, thereby avoiding the information loss inherent in traditional intermediate representations. The core of our method is a hierarchical architecture that processes visual representations through multiple parallel subspaces, initiated by a subspace decomposition module. To efficiently enhance interactions within and between these subspaces, we design the diffusion convolution block (DiCB) as our network backbone. Furthermore, we employ a reparameterized flow matching technique to directly generate the target latent vectors. This enables a principled inclusion of speech language model (SLM) and semantic losses during training, moving beyond conventional flow matching objectives and improving synthesized speech quality. Our experiments show that SLD-L2S achieves state-of-the-art generation quality on multiple benchmark datasets, surpassing existing methods in both objective and subjective evaluations.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

Related Papers