Search papers, labs, and topics across Lattice.
This paper introduces AgenticVBench, a benchmark consisting of 100 agentic tasks that assess multimodal AI agents' capabilities in real-world video post-production workflows. By leveraging insights from 20 industry experts, the benchmark evaluates frontier vision-language models (VLMs) and reveals that even the best-performing models achieve less than 30% of human expert performance. Additionally, the study highlights that the choice of harness significantly influences model behavior, including evaluation scores and tool-use patterns, underscoring the complexity of agentic tasks in video production.
Despite advances in AI, the best models struggle to match even a third of human performance in real-world video post-production tasks.
Video production workflows offer a rich and demanding arena for evaluating multimodal AI agents: they require composite capabilities across text, image, audio, and video understanding, along with long-horizon planning, and tool use. To this end, we introduce AgenticVBench, a benchmark of 100 agentic tasks across 4 task families spanning the real world post-production workflow, constructed from real production workflows contributed by 20 industry experts averaging 6 years of professional experience. Tasks are paired with evaluation specifications that combine programmatic verifiers and expert rubrics. We evaluate frontier vision-language models (VLMs) with both vendor-native and open-source harnesses. The best evaluated agent stack barely crosses 30%, far below human expert performance on the same tasks. We further find that the choice of harness substantially affects model behavior, including scores, tool-use patterns, and failure modes. AgenticVBench provides a foundation for diagnosing and improving both models and harnesses for agentic video production. Benchmark website: https://agenticvbench.com.