mm-webagentApr 9, 2026arXiv:2604.08540

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Ziwei Zhou, Ziwei Zhou, Zeyuan Lai, Zeyuan Lai, Rui Wang, Rui Wang, Yifan Yang, Zhen Xing, Zhening Xing, Yuqing Yang, Qi Dai, Qi Dai, Lili Qiu, Chong Luo, Chong Luo

AI Summary

The paper introduces AVGen-Bench, a new benchmark for Text-to-Audio-Video (T2AV) generation, designed to address the limitations of existing benchmarks that evaluate audio and video in isolation. It features 11 real-world categories of prompts and a multi-granular evaluation framework that combines specialist models with MLLMs to assess perceptual quality and semantic controllability. Experiments using AVGen-Bench reveal a significant gap between the aesthetic quality and semantic reliability of current T2AV models, particularly in areas like text rendering, speech coherence, physical reasoning, and musical pitch control.

Key Contribution

Today's best text-to-audio-video models may look and sound impressive, but they still struggle with basic physics, coherent speech, and even rendering text correctly.

Abstract

Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Related Papers