Search papers, labs, and topics across Lattice.
The paper introduces Time-Domain Voice Identity Morphing (TD-VIM), a novel signal-level approach to generate morphed voice samples that can match multiple identities, posing a security risk to speaker verification systems. TD-VIM blends voice characteristics from two individuals directly at the signal level using morphing factors. Experiments on the Multilingual Audio-Visual Smartphone database demonstrate a high attack success rate against deep-learning-based and commercial speaker verification systems, achieving G-MAP values up to 99.74%.
Speaker verification systems are shockingly vulnerable: a new signal-level voice morphing attack achieves near-perfect success rates (G-MAP up to 99.74%) against both deep learning and commercial systems.
In biometric systems, it is a common practice to associate each sample or template with a specific individual. Nevertheless, recent studies have demonstrated the feasibility of generating"morphed"biometric samples capable of matching multiple identities. These morph attacks have been recognized as potential security risks for biometric systems. However, most research on morph attacks has focused on biometric modalities that operate within the image domain, such as the face, fingerprints, and iris. In this work, we introduce Time-domain Voice Identity Morphing (TD-VIM), a novel approach for voice-based biometric morphing. This method enables the blending of voice characteristics from two distinct identities at the signal level, creating morphed samples that present a high vulnerability for speaker verification systems. Leveraging the Multilingual Audio-Visual Smartphone database, our study created four distinct morphed signals based on morphing factors and evaluated their effectiveness using a comprehensive vulnerability analysis. To assess the security impact of TD-VIM, we benchmarked our approach using the Generalized Morphing Attack Potential (G-MAP) metric, measuring attack success across two deep-learning-based Speaker Verification Systems (SVS) and one commercial system, Verispeak. Our findings indicate that the morphed voice samples achieved a high attack success rate, with G-MAP values reaching 99.40% on iPhone-11 and 99.74% on Samsung S8 in text-dependent scenarios, at a false match rate of 0.1%.