Jun 8, 2026arXiv:2606.10010

DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment

Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen

AI Summary

This paper introduces DeRA-MOS, a novel evaluation framework for text-to-music (TTM) systems that optimizes music impression (MI) and text alignment (TA) scores through decoupled listwise ranking and modality alignment. By employing a batch-aware listwise ranking loss for MI and a score-anchored modality alignment loss for TA, the framework addresses the limitations of traditional point-wise regression methods, leading to improved alignment with human evaluations. Experimental results on MusicEval show significant enhancements in MI and TA ranking metrics, establishing a more effective approach for large-scale TTM evaluation.

Key Contribution

A decoupled optimization framework for text-to-music evaluation significantly outperforms traditional methods by aligning model outputs more closely with human judgments.

Abstract

Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives do not directly optimize rank-based metrics and provide weak geometric constraints for cross-modal coherence. To address these gaps, we propose DeRA-MOS, a decoupled optimization framework for TTM evaluation. For MI, we introduce a batch-aware listwise ranking loss that models relative order within each mini-batch and better aligns with evaluation based on Spearman's rank correlation coefficient (SRCC). For TA, we introduce a score-anchored modality alignment loss that maps human scores to target audio-text similarity and regularizes the latent space before fusion. By effectively mitigating the point-wise training mismatch and modality drift, experiments on MusicEval demonstrate that our decoupled framework yields substantial improvements in both MI and TA ranking metrics, establishing a robust paradigm for large-scale TTM evaluation.

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment

Related Papers