Search papers, labs, and topics across Lattice.
This paper investigates the interpretability of Neural Audio Codec (NAC) representations, focusing on accent information, by using Sparse Autoencoders (SAEs) to decompose the dense NAC representations. They propose a framework to quantify NAC interpretability using a relative performance index across four NAC models and 16 SAE configurations. Results indicate that DAC and SpeechTokenizer exhibit the highest interpretability, with acoustic-oriented NACs encoding accent in activation magnitudes and phonetic-oriented NACs in activation positions, while low-bitrate EnCodec variants show higher interpretability.
Acoustic and phonetic NACs encode accent in fundamentally different ways, with implications for how we interpret and manipulate these representations.
Neural Audio Codecs (NACs) are widely adopted in modern speech systems, yet how they encode linguistic and paralinguistic information remains unclear. Improving the interpretability of NAC representations is critical for understanding and deploying them in sensitive applications. Hence, we employ Sparse Autoencoders (SAEs) to decompose dense NAC representations into sparse, interpretable activations. In this work, we focus on a challenging paralinguistic attribute-accent-and propose a framework to quantify NAC interpretability. We evaluate four NAC models under 16 SAE configurations using a relative performance index. Our results show that DAC and SpeechTokenizer achieve the highest interpretability. We further reveal that acoustic-oriented NACs encode accent information primarily in activation magnitudes of sparse representations, whereas phonetic-oriented NACs rely more on activation positions, and that low-bitrate EnCodec variants show higher interpretability.