Search papers, labs, and topics across Lattice.
The paper introduces VoxEffects, a new speech audio effects dataset with precise effect-chain supervision at multiple granularities. It supports speech-oriented audio effect identification tasks, including effect presence detection, preset classification, and intensity prediction. Experiments using an AudioMAE-based multi-task baseline demonstrate the impact of domain shift, robustness, input duration, and gender fairness on performance.
Training AI to identify audio effects in speech is now possible, thanks to a new dataset that precisely labels how effects are applied.
Speech audio in the wild is often processed by post-production effects, but existing speech datasets rarely provide precise annotations of effects and parameters, limiting systematic study. We introduce VoxEffects, a speech audio effects dataset that pairs produced speech with exact effect-chain supervision at multiple granularities. VoxEffects supports speech-oriented audio effect identification: given a produced waveform, infer which effects are present and how they are applied. Built from minimally edited clean speech, it provides an extensible rendering pipeline for both offline synthesis and on-the-fly rendering for efficient training and evaluation. The audio effect identification benchmark includes effect presence detection, preset classification, and intensity prediction, with a robustness protocol covering capture-side and platform-side degradations. We provide an AudioMAE-based multi-task baseline and analyses of domain shift, robustness, input duration, and gender fairness.