Search papers, labs, and topics across Lattice.
The paper introduces VQQA, a multi-agent framework that uses Vision-Language Model (VLM) critiques as semantic gradients to optimize video generation prompts. VQQA dynamically generates visual questions about the video and uses the VLM's answers to provide actionable feedback, enabling efficient closed-loop prompt optimization via a black-box natural language interface. Experiments on text-to-video and image-to-video tasks show that VQQA significantly improves generation quality, outperforming existing stochastic search and prompt optimization methods by +11.57% on T2V-CompBench and +8.43% on VBench2.
Forget slow, white-box optimization: VQQA uses a clever question-answering agent to steer video generation models toward user intent through a simple text interface.
Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially improving generation quality in just a few refinement steps. Applicable to both text-to-video (T2V) and image-to-video (I2V) tasks, our method achieves absolute improvements of +11.57% on T2V-CompBench and +8.43% on VBench2 over vanilla generation, significantly outperforming state-of-the-art stochastic search and prompt optimization techniques.