KuaishouNJUJun 7, 2026arXiv:2606.08415

CoVEBench: Can Video Editing Models Handle Complex Instructions?

Jiangtao Wu, Jiaming Wang, Yiwen He, Yuanxing Zhang, Shihao Li, Dunyuan Liu, Xuedong Zhao, Jialu Chen, Zekun Moore Wang, Jiaheng Liu

AI Summary

This paper introduces CoVEBench, a novel benchmark designed to evaluate the performance of video editing models on complex, compositional instructions that reflect real-world user requests. By incorporating 416 curated source videos and 626 multi-point editing instructions, the benchmark assesses models based on both MLLM-judged compliance and automated video quality metrics. The findings reveal that existing models struggle significantly with compositional editing tasks, often failing to execute multiple edits accurately while maintaining content integrity.

Key Contribution

Current video editing models falter under the weight of complex user instructions, often omitting critical edits and introducing artifacts.

Abstract

While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.

Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CoVEBench: Can Video Editing Models Handle Complex Instructions?

Related Papers