May 6, 2026arXiv:2605.04998

Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation

AI Summary

This paper investigates the impact of mixed-genre training data on chord generation, specifically fine-tuning a pop-pretrained Music Transformer on jazz. The study systematically varies the amount of pop data retained during jazz fine-tuning to understand the trade-off between acquiring new styles and maintaining old ones. Results show that retaining approximately 2.5K pop samples (1.65x the jazz volume) recovers baseline pop accuracy after jazz-only fine-tuning, but perceptual preference doesn't always align with metric-best performance.

Key Contribution

Fine-tuning a chord generation model on a new genre requires only a surprisingly small amount of old-genre data to prevent catastrophic forgetting, but objective metrics don't always capture subjective stylistic preferences.

Abstract

Chord progression generation is practically important but understudied. Most large-scale symbolic music systems target melody, multi-track arrangement, or audio synthesis, and chord-only models tend to be relegated to conditioning components inside larger pipelines. This paper treats chord generation as a standalone task and addresses a question that arises whenever such a model is adapted across genres: how much old-domain data must be retained during fine-tuning to acquire a new domain without forgetting the old? I study jazz fine-tuning starting from a pop-pretrained 25M-parameter Music Transformer (84.24% top-1 chord accuracy on a held-out pop test set). The available jazz corpus is an order of magnitude smaller than the pop corpus, so every fine-tune run uses all 1,513 jazz training sequences. The swept variable is the volume of pop "rehearsal" data mixed alongside, taking values in {0, 1K, 2.5K, 5K, 10K}. Every fine-tuned model gains 7 to 9 points of jazz top-1. Pop accuracy collapses by 2.14 points under jazz-only fine-tuning, recovers to baseline at approximately 2.5K rehearsal samples (1.65x the jazz volume), and saturates beyond that point. A complementary observation: the metric-best run (F3, 2.5K mix) is not always the perceptually preferred one. The pop-leaning (10K) and jazz-leaning (1K) endpoints carry more committed stylistic identities that the author more often selects as finished output in informal listening. I discuss what this suggests for music co-creation tools but make no perceptual claim, since no formal listening study has been conducted. All six checkpoints are released on the HuggingFace Hub at https://huggingface.co/PearlLeeStudio.

Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation

Related Papers