Mar 16, 2026arXiv:2603.15440

Music Genre Classification: A Comparative Analysis of Classical Machine Learning and Deep Learning Approaches

Sachin Prajuli, Abhishek Karna, OmPrakash Dhakl

AI Summary

A novel dataset of 8,000 labeled 30-second audio clips spanning eight Nepali music genres was created to address the gap in non-Western music genre classification. Nine classification models, including classical machine learning and deep learning approaches, were trained and compared using hand-crafted audio features and Mel spectrograms, respectively. The sequential Convolutional Recurrent Neural Network (CRNN) achieved the highest accuracy of 84%, outperforming both classical models and other deep learning architectures.

Key Contribution

A sequential CNN-RNN architecture beats classical ML and other deep learning approaches at classifying Nepali music genres, achieving 84% accuracy on a new dataset of 8,000 audio clips.

Abstract

Automatic music genre classification is a long-standing challenge in Music Information Retrieval (MIR); work on non-Western music traditions remains scarce. Nepali music encompasses culturally rich and acoustically diverse genres--from the call-and-response duets of Lok Dohori to the rhythmic poetry of Deuda and the distinctive melodies of Tamang Selo--that have not been addressed by existing classification systems. In this paper, we construct a novel dataset of approximately 8,000 labeled 30-second audio clips spanning eight Nepali music genres and conduct a systematic comparison of nine classification models across two paradigms. Five classical machine learning classifiers (Logistic Regression, SVM, KNN, Random Forest, and XGBoost) are trained on 51 hand-crafted audio features extracted via Librosa, while four deep learning architectures (CNN, RNN, parallel CNN-RNN, and sequential CNN followed by RNN) operate on Mel spectrograms of dimension 640 x 128. Our experiments reveal that the sequential Convolutional Recurrent Neural Network (CRNN)--in which convolutional layers feed into an LSTM--achieves the highest accuracy of 84%, substantially outperforming both the best classical models (Logistic Regression and XGBoost, both at 71%) and all other deep architectures. We provide per-class precision, recall, F1-score, confusion matrices, and ROC analysis for every model, and offer a culturally grounded interpretation of misclassification patterns that reflects genuine overlaps in Nepal's musical traditions.

Computer Vision Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Music Genre Classification: A Comparative Analysis of Classical Machine Learning and Deep Learning Approaches

Related Papers