Korea UJun 10, 2026arXiv:2606.11838

Plan-and-Verify Video Reward Reasoning with Spatio-Temporal Scene Graph Grounding

Hyomin Kim, Junghye Kim, Joanie Hayoun Chung, Yoonjin Oh, Kyungjae Lee, Sungbin Lim, Sungwoong Kim

AI Summary

This paper introduces SG-PVR, a novel video reward model that enhances text-to-video generation by employing plan-and-verify reasoning grounded in spatio-temporal scene graphs. By systematically verifying each condition in the prompt and utilizing a structured visual reference, SG-PVR significantly improves semantic alignment and temporal semantics in generated videos. The model not only excels in fine-grained evaluations but also acts as a test-time reranker, boosting compositional alignment in T2V tasks.

Key Contribution

Achieving fine-grained semantic alignment in text-to-video generation is now possible with a model that explicitly verifies every prompt condition against visual evidence.

Abstract

Reward models for text-to-video (T2V) generation guide post-training but often fail at fine-grained semantic alignment. We trace this to two structural weaknesses in existing reasoning-based reward models: they do not systematically verify every condition described in the prompt, and the visual evidence supporting each judgment remains implicit in their free-form reasoning. We propose SG-PVR, a video reward model that addresses these limitations through plan-and-verify reasoning grounded in spatio-temporal scene graphs. The verification plan decomposes the prompt into atomic claims, ensuring every requirement is checked. The spatio-temporal scene graph, encoding entities, attributes, and temporally-grounded relations, is extracted from the video and maintained as a persistent structured visual reference throughout reasoning. Each claim is verified against both the video and the scene graph, anchoring judgments in explicit visual evidence. SG-PVR achieves strong performance on semantic alignment, including fine-grained temporal semantics. As a test-time reranker, it further enhances compositional alignment in T2V generation.

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Plan-and-Verify Video Reward Reasoning with Spatio-Temporal Scene Graph Grounding

Related Papers