Search papers, labs, and topics across Lattice.
This paper introduces BareWave, a novel framework for direct text-to-wave generation in text-to-speech (TTS) that eliminates the need for intermediate acoustic representations. The authors address key challenges in raw-waveform modeling, including the lack of pretrained scaffolds and the necessity for tailored noise schedules, by implementing a training approach that incorporates representation alignment and velocity-aware perceptual objectives. Experiments demonstrate that BareWave achieves high intelligibility, speaker similarity, and naturalness in zero-shot voice cloning, validating the effectiveness of a waveform-native inference path.
Achieving high-quality text-to-speech synthesis without intermediate representations, BareWave shows that direct waveform generation can rival traditional methods in intelligibility and naturalness.
Removing intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthesis. In this work, we present BareWave, a fully waveform-native framework for direct text-to-wave generation in flow-matching TTS. We consider this setting to raise three training challenges: raw-waveform modeling lacks a strong pretrained representational scaffold, different stages of training benefit from different noise schedules, and data-space perceptual objectives do not automatically share the temporal structure of the velocity-space flow objective. As a result, direct waveform training is hard to optimize efficiently, hard to push toward a strong final operating point with a fixed recipe, and hard to integrate effective perceptual refinement. Guided by this view, we develop a direct text-to-wave training framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), while preserving a single waveform-native inference path without pretrained components at test time. Experiments on zero-shot voice cloning show that strong intelligibility, speaker similarity, and naturalness can be achieved under a fully waveform-native inference path, supporting waveform-native flow-matching TTS as a practical direction. Project page with audio demos is available at https://barewave.github.io/.