Jan 20, 2026arXiv:2601.14227

Transformer Architectures for Respiratory Sound Analysis and Multimodal Diagnosis

Theodore Aptekarev, V. Sokolovsky, G. Furman

AI Summary

This paper adapts the Audio Spectrogram Transformer (AST) for respiratory sound analysis, fine-tuning it on a medical dataset and comparing its performance to a CNN baseline and a multimodal Vision-Language Model (VLM). The AST model achieves approximately 97% accuracy, outperforming the CNN baseline and external benchmarks for asthma detection. The VLM, integrating spectrograms with patient metadata, reaches 86-87% accuracy, demonstrating the potential of multimodal architectures for diagnosis.

Key Contribution

Self-attention models can diagnose asthma from respiratory sounds with 97% accuracy, outperforming CNNs and even integrating patient metadata via VLMs for more holistic diagnosis.

Abstract

Respiratory sound analysis is a crucial tool for screening asthma and other pulmonary pathologies, yet traditional auscultation remains subjective and experience-dependent. Our prior research established a CNN baseline using DenseNet201, which demonstrated high sensitivity in classifying respiratory sounds. In this work, we (i) adapt the Audio Spectrogram Transformer (AST) for respiratory sound analysis and (ii) evaluate a multimodal Vision-Language Model (VLM) that integrates spectrograms with structured patient metadata. AST is initialized from publicly available weights and fine-tuned on a medical dataset containing hundreds of recordings per diagnosis. The VLM experiment uses a compact Moondream-type model that processes spectrogram images alongside a structured text prompt (sex, age, recording site) to output a JSON-formatted diagnosis. Results indicate that AST achieves approximately 97% accuracy with an F1-score around 97% and ROC AUC of 0.98 for asthma detection, significantly outperforming both the internal CNN baseline and typical external benchmarks. The VLM reaches 86-87% accuracy, performing comparably to the CNN baseline while demonstrating the capability to integrate clinical context into the inference process. These results confirm the effectiveness of self-attention for acoustic screening and highlight the potential of multimodal architectures for holistic diagnostic tools.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models

Citation Metrics

Citations0

Influential citations0

References38

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Transformer Architectures for Respiratory Sound Analysis and Multimodal Diagnosis

Related Papers