Search papers, labs, and topics across Lattice.
This paper introduces SG-PVR, a novel video reward model that enhances text-to-video generation by employing plan-and-verify reasoning grounded in spatio-temporal scene graphs. By systematically verifying each condition in the prompt and utilizing a structured visual reference, SG-PVR significantly improves semantic alignment and temporal semantics in generated videos. The model not only excels in fine-grained evaluations but also acts as a test-time reranker, boosting compositional alignment in T2V tasks.
Achieving fine-grained semantic alignment in text-to-video generation is now possible with a model that explicitly verifies every prompt condition against visual evidence.
Reward models for text-to-video (T2V) generation guide post-training but often fail at fine-grained semantic alignment. We trace this to two structural weaknesses in existing reasoning-based reward models: they do not systematically verify every condition described in the prompt, and the visual evidence supporting each judgment remains implicit in their free-form reasoning. We propose SG-PVR, a video reward model that addresses these limitations through plan-and-verify reasoning grounded in spatio-temporal scene graphs. The verification plan decomposes the prompt into atomic claims, ensuring every requirement is checked. The spatio-temporal scene graph, encoding entities, attributes, and temporally-grounded relations, is extracted from the video and maintained as a persistent structured visual reference throughout reasoning. Each claim is verified against both the video and the scene graph, anchoring judgments in explicit visual evidence. SG-PVR achieves strong performance on semantic alignment, including fine-grained temporal semantics. As a test-time reranker, it further enhances compositional alignment in T2V generation.