TohokuV evaluation systems. NumerousMar 31, 2026arXiv:2603.29186

SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

Ryosuke Matsuda, Keito Kudo, Haruto Yoshida, Nobuyuki Shimizu, Jun Suzuki

AI Summary

The paper introduces SLVMEval, a synthetic benchmark for meta-evaluating text-to-long-video (T2V) evaluation systems on videos up to 3 hours long. SLVMEval uses synthetically degraded video pairs across 10 aspects, filtered by crowdsourcing to ensure clear quality differences. Experiments reveal that existing evaluation systems struggle to match human accuracy in assessing long video quality, highlighting significant weaknesses in current T2V evaluation methods.

Key Contribution

Current text-to-long-video evaluation metrics can't reliably assess video quality, failing to match human judgment in 9 out of 10 tested degradation aspects.

Abstract

This paper proposes the synthetic long-video meta-evaluation (SLVMEval), a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. The proposed SLVMEval benchmark focuses on assessing these systems on videos of up to 10,486 s (approximately 3 h). The benchmark targets a fundamental requirement, namely, whether the systems can accurately assess video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video-captioning datasets, we synthetically degrade source videos to create controlled"high-quality versus low-quality"pairs across 10 distinct aspects. Then, we employ crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing an effective final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Experimental results demonstrate that human evaluators can identify the better long video with 84.7%-96.8% accuracy, and in nine of the 10 aspects, the accuracy of these systems falls short of human assessment, revealing weaknesses in text-to-long-video evaluation.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

Related Papers