ZJUMay 27, 2026arXiv:2605.28618

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Changhao Pan, Rui Yang, Han Wang, Hankun Wang, Zhuanzhong Zhou, Zhuan Zhou, Xuming He, Wenxiang Guo, Ziyue Jiang, Ruiqi Li, Yu Zhang, Chenyuhao Wen, Ke Lei, Xiang Yin, Jingyu Lu, Zhiyuan Zhu, Zhou Zhao

AI Summary

The paper introduces Swanbench-Speech, a new benchmark for evaluating long-form speech generation models across diverse scenarios like dialog and expressive speech. It addresses limitations in existing benchmarks by focusing on long-text factors like consistency and coherence, and covering a wider range of acoustic, semantic, and expressive challenges. Experiments using Swanbench-Speech reveal that current models struggle with expressiveness, consistency, and hierarchical structure compared to real speech.

Key Contribution

Current speech generation models still fall short in maintaining consistency and capturing nuanced expressiveness when generating long-form speech, despite advances in high-fidelity synthesis.

Abstract

Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References82

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Related Papers