Lingguang Zhaxian TechnologyNorthwesternYutu ZhinengApr 20, 2026arXiv:2604.17958

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

Huakang Chen, Jingbin Hu, Liumeng Xue, Qirui Zhan, Wenhao Li, Guobin Ma, Hanke Xie, Dake Guo, Linhan Ma, Yuepeng Jiang, Bengu Wu, Ben Wu, Pengyuan Xie, Peng Xie, Chuan Xie, Qiang Zhang

AI Summary

MINT-Bench, a new benchmark for instruction-following TTS, addresses limitations in existing evaluations through a hierarchical taxonomy, a scalable data construction pipeline, and a hybrid evaluation protocol assessing content consistency, instruction following, and perceptual quality. Experiments across ten languages reveal that while commercial systems generally lead, open-source models can outperform them in specific languages like Chinese. The benchmark highlights compositional and paralinguistic controls as key challenges for current TTS systems.

Key Contribution

Open-source TTS models can beat commercial systems in specific languages, but current instruction-following TTS still struggles with complex instructions like nuanced paralinguistic controls.

Abstract

Instruction-following text-to-speech (TTS) has emerged as an important capability for controllable and expressive speech generation, yet its evaluation remains underdeveloped due to limited benchmark coverage, weak diagnostic granularity, and insufficient multilingual support. We present \textbf{MINT-Bench}, a comprehensive multilingual benchmark for instruction-following TTS. MINT-Bench is built upon a hierarchical multi-axis taxonomy, a scalable multi-stage data construction pipeline, and a hierarchical hybrid evaluation protocol that jointly assesses content consistency, instruction following, and perceptual quality. Experiments across ten languages show that current systems remain far from solved: frontier commercial systems lead overall, while leading open-source models become highly competitive and can even outperform commercial counterparts in localized settings such as Chinese. The benchmark further reveals that harder compositional and paralinguistic controls remain major bottlenecks for current systems. We release MINT-Bench together with the data construction and evaluation toolkit to support future research on controllable, multilingual, and diagnostically grounded TTS evaluation. The leaderboard and demo are available at https://longwaytog0.github.io/MINT-Bench/

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

Related Papers