Charles UniversityInsight Centre for Data AnalyticsFeb 7, 2025arXiv:2502.04718

Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?

Sourabrata Mukherjee, Atul Kr. Ojha, John P. Mccrae, Ondrej Dusek

AI Summary

This paper investigates the reliability of existing and novel automatic metrics, including LLM-based evaluations, for text style transfer (TST) across sentiment transfer and detoxification tasks in English, Hindi, and Bengali. Through meta-evaluation correlating automatic metrics with human judgments, the study identifies metrics that outperform existing TST metrics when used individually and especially in oracle ensembles. The results demonstrate that advanced NLP metrics and LLM-based evaluations offer improved insights into TST performance compared to traditional metrics.

Key Contribution

Forget standard metrics: advanced NLP metrics and LLM-based evaluations are the new gold standard for text style transfer, outperforming traditional measures in multilingual settings.

Abstract

Text style transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content. Evaluating TST outputs is a multidimensional challenge, requiring the assessment of style transfer accuracy, content preservation, and naturalness. Using human evaluation is ideal but costly, as is common in other natural language processing (NLP) tasks, however, automatic metrics for TST have not received as much attention as metrics for, e.g., machine translation or summarization. In this paper, we examine both set of existing and novel metrics from broader NLP tasks for TST evaluation, focusing on two popular subtasks, sentiment transfer and detoxification, in a multilingual context comprising English, Hindi, and Bengali. By conducting meta-evaluation through correlation with human judgments, we demonstrate the effectiveness of these metrics when used individually and in ensembles. Additionally, we investigate the potential of large language models (LLMs) as tools for TST evaluation. Our findings highlight newly applied advanced NLP metrics and LLM-based evaluations provide better insights than existing TST metrics. Our oracle ensemble approaches show even more potential.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations3

Influential citations0

References62

Year2025

VenueNorth American Chapter of the Association for Computational Linguistics

Related Papers

Finding related papers...

Search

Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?

Related Papers