Mar 8, 2026arXiv:2603.07708

VoiceSHIELD-Small: Real-Time Malicious Speech Detection and Transcription

Sumit Ranjan, Sugandha Sharma, Ubaid Abbas, Puneeth N Ail

AI Summary

VoiceSHIELD-Small is introduced as a real-time model for simultaneous speech transcription and malicious speech detection, addressing security risks in voice interfaces. It extends OpenAI's Whisper-small encoder with a mean-pooling layer and classification head, enabling rapid audio classification (90-120ms on mid-tier GPUs). Evaluated on a balanced dataset, VoiceSHIELD-Small achieved 99.16% accuracy and a 0.9865 F1 score, demonstrating its effectiveness in identifying harmful voice inputs with low latency.

Key Contribution

Achieve near-perfect accuracy in real-time malicious speech detection without sacrificing transcription speed, using a lightweight model built on Whisper.

Abstract

Voice interfaces are quickly becoming a common way for people to interact with AI systems. This also brings new security risks, such as prompt injection, social engineering, and harmful voice commands. Traditional security methods rely on converting speech to text and then filtering that text, which introduces delays and can ignore important audio cues. This paper introduces VoiceSHIELD-Small, a lightweight model that works in real time. It can transcribe speech and detect whether it is safe or harmful, all in one step. Built on OpenAI's Whisper-small encoder, VoiceSHIELD adds a mean-pooling layer and a simple classification head. It takes just 90-120 milliseconds to classify audio on mid-tier GPUs, while transcription happens at the same time. Tested on a balanced set of 947 audio clips, the model achieved 99.16 percent accuracy and an F1 score of 0.9865. At the default setting, it missed 2.33 percent of harmful inputs. Cross-validation showed consistent performance (F1 standard deviation = 0.0026). The paper also covers the model's design, training data, performance trade-offs, and responsible use guidelines. VoiceSHIELD is released under the MIT license to encourage further research and adoption in voice AI security.

Natural Language Processing Red-Teaming & Adversarial Robustness Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VoiceSHIELD-Small: Real-Time Malicious Speech Detection and Transcription

Related Papers