Search papers, labs, and topics across Lattice.
The paper addresses the challenge of speech enhancement in real-world scenarios with compound degradations by proposing a novel conditioning method for diffusion-based models. Instead of injecting degradation information only at the input layer, the proposed SLICE method injects conditioning embeddings, derived from a pretrained encoder with multi-task heads, into the timestep embedding of the diffusion model. This layer-wise injection strategy allows the conditioning information to propagate through all residual blocks, leading to improved performance on compound degradations and better generalization to real-world recordings.
Injecting noise estimates layer-by-layer into diffusion models dramatically improves speech enhancement in complex, real-world conditions where single-point injection fails.
Real-world speech is often corrupted by multiple degradations simultaneously, including additive noise, reverberation, and nonlinear distortion. Diffusion-based enhancement methods perform well on single degradations but struggle with compound corruptions. Prior noise-aware approaches inject conditioning at the input layer only, which can degrade performance below that of an unconditioned model. To address this, we propose injecting degradation conditioning, derived from a pretrained encoder with multi-task heads for noise type, reverberation, and distortion, into the timestep embedding so that it propagates through all residual blocks without architectural changes. In controlled experiments where only the injection method varies, input-level conditioning performs worse than no encoder at all on compound degradations, while layer-wise injection achieves the best results. The method also generalizes to diverse real-world recordings.