Search papers, labs, and topics across Lattice.
B∑br~b\bar{r}=\frac{1}{B}\sum_{b}\tilde{r}^{b}: A~b=r~b−r¯\tilde{A}^{b}=\tilde{r}^{b}-\bar{r} and Ab=rb−r¯A^{b}=r^{b}-\bar{r}. We use only recentering (no standard-deviation normalization) to avoid difficulty bias induced by small reward variance [22]. Adversarial Online discriminator update. In unverifiable generation tasks, the generator distribution pθp_{\theta} inevitably drifts from the initial training support, rendering static rewards unreliable and susceptible to hacking. To mitigate this, we employ an iterative adversarial update strategy. Periodically, we refresh the discriminator by keep optimizing Eq. (7), using the most recent on-policy samples x~0b\tilde{x}_{0}^{b} as negative examples against the ground-truth videos. In this framework, the reward approximates the log-density ratio logpdata(𝐱)pθ(𝐱)\log\frac{p_{\text{data}}(\mathbf{x})}{p_{\theta}(\mathbf{x})} in feature space, ensuring the signal remains accurate and robust to distribution shift throughout training. The complete SHIFT procedure is outlined in Algorithm 1. Remark. While SHIFT’s loss structure (Eq. (12)) bears a superficial resemblance to the reward-weighted regression plus SFT objective of Lee et al. [17], the two methods differ in several fundamental respects. First, SHIFT derives its objective from a principled Forward KL formulation (Eq. (9)), whereas Lee et al. adopt a heuristic combination of RWR and SFT without such theoretical grounding. Second, SHIFT replaces raw rewards with group-relative advantages and employs adversarial reward-model co-training to prevent reward hacking—mechanisms absent in Lee et al. Third, our reward models are fully automatic, motion-aware discriminators trained without any human annotation; this is essential for video motion alignment, where pixel-level temporal dynamics are practically infeasible for humans to label at scale. Finally, although SHIFT alternates between generator and discriminator updates, it differs fundamentally from standard GAN training: the discriminator provides only a scalar reward signal used to compute advantages, and its gradient is never backpropagated into the generator. As shown by Pfau and Vinyals [28], GANs can be viewed as actor-critic methods where coupled gradient-based min-max optimization is the root cause of training instability; SHIFT sidesteps this by fully decoupling the two optimizations, retaining the discriminator’s adaptive distribution-awareness without inheriting adversarial gradient dynamics. 5 Experiments Implementation. We validate SHIFT on two video diffusion models: SVD (∼\sim1.
1
0
2
Image-conditioned video diffusion models can now be fine-tuned to produce more realistic motion dynamics and long-term temporal coherence via a novel reward-driven approach that avoids common pitfalls like reward hacking.