Wenjia Yang

B∑br~b\bar{r}=\frac{1}{B}\sum_{b}\tilde{r}^{b}: A~b=r~b−r¯\tilde{A}^{b}=\tilde{r}^{b}-\bar{r} and Ab=rb−r¯A^{b}=r^{b}-\bar{r}. We use only recentering (no standard-deviation normalization) to avoid difficulty bias induced by small reward variance [22]. Adversarial Online discriminator update. In unverifiable generation tasks, the generator distribution pθp_{\theta} inevitably drifts from the initial training support, rendering static rewards unreliable and susceptible to hacking. To mitigate this, we employ an iterative adversarial update strategy. Periodically, we refresh the discriminator by keep optimizing Eq. (7), using the most recent on-policy samples x~0b\tilde{x}_{0}^{b} as negative examples against the ground-truth videos. In this framework, the reward approximates the log-density ratio log⁡pdata(𝐱)pθ(𝐱)\log\frac{p_{\text{data}}(\mathbf{x})}{p_{\theta}(\mathbf{x})} in feature space, ensuring the signal remains accurate and robust to distribution shift throughout training. The complete SHIFT procedure is outlined in Algorithm 1. Remark. While SHIFT’s loss structure (Eq. (12)) bears a superficial resemblance to the reward-weighted regression plus SFT objective of Lee et al. [17], the two methods differ in several fundamental respects. First, SHIFT derives its objective from a principled Forward KL formulation (Eq. (9)), whereas Lee et al. adopt a heuristic combination of RWR and SFT without such theoretical grounding. Second, SHIFT replaces raw rewards with group-relative advantages and employs adversarial reward-model co-training to prevent reward hacking—mechanisms absent in Lee et al. Third, our reward models are fully automatic, motion-aware discriminators trained without any human annotation; this is essential for video motion alignment, where pixel-level temporal dynamics are practically infeasible for humans to label at scale. Finally, although SHIFT alternates between generator and discriminator updates, it differs fundamentally from standard GAN training: the discriminator provides only a scalar reward signal used to compute advantages, and its gradient is never backpropagated into the generator. As shown by Pfau and Vinyals [28], GANs can be viewed as actor-critic methods where coupled gradient-based min-max optimization is the root cause of training instability; SHIFT sidesteps this by fully decoupling the two optimizations, retaining the discriminator’s adaptive distribution-awareness without inheriting adversarial gradient dynamics. 5 Experiments Implementation. We validate SHIFT on two video diffusion models: SVD (∼\sim1.

Papers on Lattice

Total citations

Topics

Publication activitypapers/week, last 8 weeks

Research focus

Computer Vision (1)Multimodal Models (1)

Frequent co-authors

Xi Ye (1)Yangyang Xu (1)Xiaoyang Liu (1)Duo Su (1)

Papers (1)

Mar 18, 2026

2w ago·also Cohere

SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning

Image-conditioned video diffusion models can now be fine-tuned to produce more realistic motion dynamics and long-term temporal coherence via a novel reward-driven approach that avoids common pitfalls like reward hacking.

Xi Ye, Wenjia Yang, Yangyang Xu +4

Computer Vision Multimodal Models

Search

Wenjia Yang

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (1)