Feb 17, 2026arXiv:2602.15484

Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction

Amartyaveer, Murali Kadambi, Chandra Mohan Sharma, Anupam Mondal, Prasanta Kumar Ghosh

AI Summary

This paper introduces a bottleneck transformer architecture for non-intrusive prediction of the Short-Time Objective Intelligibility (STOI) score, addressing the limitations of traditional STOI calculation methods that require clean reference speech. The proposed model leverages convolution blocks for frame-level feature extraction and a multi-head self-attention (MHSA) layer for information aggregation, enabling the transformer to focus on crucial input aspects. Experimental results demonstrate that the bottleneck transformer model achieves higher correlation and lower mean squared error compared to state-of-the-art models relying on self-supervised learning and spectral features, for both seen and unseen data.

Key Contribution

Ditch the clean reference speech: a bottleneck transformer predicts speech intelligibility better than SSL-based methods, even in noisy, unseen conditions.

Abstract

In this study, we have presented a novel approach to predict the Short-Time Objective Intelligibility (STOI) metric using a bottleneck transformer architecture. Traditional methods for calculating STOI typically requires clean reference speech, which limits their applicability in the real world. To address this, numerous deep learning-based nonintrusive speech assessment models have garnered significant interest. Many studies have achieved commendable performance, but there is room for further improvement. We propose the use of bottleneck transformer, incorporating convolution blocks for learning frame-level features and a multi-head self-attention (MHSA) layer to aggregate the information. These components enable the transformer to focus on the key aspects of the input data. Our model has shown higher correlation and lower mean squared error for both seen and unseen scenarios compared to the state-of-the-art model using self-supervised learning (SSL) and spectral features as inputs.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction

Related Papers