SapienzaMay 5, 2026arXiv:2605.03929

PHALAR: Phasors for Learned Musical Audio Representations

Davide Marincione, Michele Mancusi, G. Strano, Luca Cerovaz, Donato Crisostomi, Roberto Ribuoli, E. Rodolà

AI Summary

PHALAR, a novel contrastive framework, addresses the challenge of stem retrieval by incorporating temporal information often discarded by existing models. It achieves a 70% relative accuracy increase over the state-of-the-art with fewer parameters and faster training by using a Learned Spectral Pooling layer and a complex-valued head to enforce pitch- and phase-equivariance. PHALAR's superior performance on stem retrieval, along with its strong correlation with human coherence judgment and zero-shot transfer to beat tracking and chord probing, highlights its ability to capture robust musical structures.

Key Contribution

Stem retrieval accuracy leaps forward by 70% thanks to a new architecture that finally respects the phase of music.

Abstract

Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References55

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PHALAR: Phasors for Learned Musical Audio Representations

Related Papers