Search papers, labs, and topics across Lattice.
The paper introduces AVGen-Bench, a new benchmark for Text-to-Audio-Video (T2AV) generation, designed to address the limitations of existing benchmarks that evaluate audio and video in isolation. It features 11 real-world categories of prompts and a multi-granular evaluation framework that combines specialist models with MLLMs to assess perceptual quality and semantic controllability. Experiments using AVGen-Bench reveal a significant gap between the aesthetic quality and semantic reliability of current T2AV models, particularly in areas like text rendering, speech coherence, physical reasoning, and musical pitch control.
Today's best text-to-audio-video models may look and sound impressive, but they still struggle with basic physics, coherent speech, and even rendering text correctly.
Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.