Search papers, labs, and topics across Lattice.
This paper evaluates the latent space structure of three VAE architectures for musical timbre generation: unsupervised, descriptor-conditioned, and perceptual feature-conditioned. They use clustering and interpretability metrics like silhouette scores, timbre descriptor compactness, and pitch-conditional separation to assess the latent space organization. The key finding is that conditioning on continuous perceptual features results in a more compact, discriminative, and pitch-invariant latent space compared to unsupervised and descriptor-conditioned VAEs.
Forget one-hot encodings: conditioning timbre VAEs on continuous perceptual features unlocks more compact and controllable latent spaces.
We present a comparative evaluation of latent space organization in three Variational Autoencoders (VAEs) for musical timbre generation: an unsupervised VAE, a descriptor-conditioned VAE, and a VAE conditioned on continuous perceptual features from the AudioCommons timbral models. Using a curated dataset of electric guitar sounds labeled with 19 semantic descriptors across four intensity levels, we assess each model's latent structure with a suite of clustering and interpretability metrics. These include silhouette scores, timbre descriptor compactness, pitch-conditional separation, trajectory linearity, and cross-pitch consistency. Our findings show that conditioning on perceptual features yields a more compact, discriminative, and pitch-invariant latent space, outperforming both the unsupervised and discrete descriptor-conditioned models. This work highlights the limitations of one-hot semantic conditioning and provides methodological tools for evaluating timbre latent spaces, contributing to the development of more controllable and interpretable generative audio models.