Search papers, labs, and topics across Lattice.
B∑br~b\bar{r}=\frac{1}{B}\sum_{b}\tilde{r}^{b}: A~b=r~b−r¯\tilde{A}^{b}=\tilde{r}^{b}-\bar{r} and Ab=rb−r¯A^{b}=r^{b}-\bar{r}. We use only recentering (no standard-deviation normalization) to avoid difficulty bias induced by small reward variance [22]. Adversarial Online discriminator update. In unverifiable generation tasks, the generator distribution pθp_{\theta} inevitably drifts from the initial training support, rendering static rewards unreliable and susceptible to hacking. To mitigate this, we employ an iterative adversarial update strategy. Periodically, we refresh the discriminator by keep optimizing Eq. (7), using the most recent on-policy samples x~0b\tilde{x}_{0}^{b} as negative examples against the ground-truth videos. In this framework, the reward approximates the log-density ratio logpdata(𝐱)pθ(𝐱)\log\frac{p_{\text{data}}(\mathbf{x})}{p_{\theta}(\mathbf{x})} in feature space, ensuring the signal remains accurate and robust to distribution shift throughout training. The complete SHIFT procedure is outlined in Algorithm 1. Remark. While SHIFT’s loss structure (Eq. (12)) bears a superficial resemblance to the reward-weighted regression plus SFT objective of Lee et al. [17], the two methods differ in several fundamental respects. First, SHIFT derives its objective from a principled Forward KL formulation (Eq. (9)), whereas Lee et al. adopt a heuristic combination of RWR and SFT without such theoretical grounding. Second, SHIFT replaces raw rewards with group-relative advantages and employs adversarial reward-model co-training to prevent reward hacking—mechanisms absent in Lee et al. Third, our reward models are fully automatic, motion-aware discriminators trained without any human annotation; this is essential for video motion alignment, where pixel-level temporal dynamics are practically infeasible for humans to label at scale. Finally, although SHIFT alternates between generator and discriminator updates, it differs fundamentally from standard GAN training: the discriminator provides only a scalar reward signal used to compute advantages, and its gradient is never backpropagated into the generator. As shown by Pfau and Vinyals [28], GANs can be viewed as actor-critic methods where coupled gradient-based min-max optimization is the root cause of training instability; SHIFT sidesteps this by fully decoupling the two optimizations, retaining the discriminator’s adaptive distribution-awareness without inheriting adversarial gradient dynamics. 5 Experiments Implementation. We validate SHIFT on two video diffusion models: SVD (∼\sim1., V for the SVD model. FVD are tested on the DAVIS2017 validation set. Higher (↑\uparrow) is better; lower (↓\downarrow) is better. Best is bold; second-best is underlined. VBench Appearance (↑\uparrow), Zhaotong Yang is with the School of Computer Science and Technology, Ocean University of China, Qingdao, China, and the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China. E-mail: ztyang808@njust.edu.cn. Yong Du is with the School of Computer Science and Technology, Ocean University of China, Qingdao, China, and with the Sanya Oceanographic Institution, Ocean University of China, Sanya, China. E-mail: csyongdu@ouc.edu.cn. Shengfeng He is with the School of Computing and Information Systems, Singapore Management University, Singapore. Email: shengfenghe@smu.edu.sg. Yuhui Li, Xinzhe Li, and Junyu Dong are with the School of Computer Science and Technology, Ocean University of China, Qingdao, China. E-mail: liyuhui1150@stu.ouc.edu.cn, lixinzhe@stu.ouc.edu.cn, dongjunyu@ouc.edu.cn. Yangyang Xu is with the School of Intelligence Science and Engineering, Harbin Institute of Technology, Shenzhen, China. Email: xuyangyang@hit.edu.cn. Jian Yang is with the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China. E-mail: csjyang@njust.edu.cn
2
0
2
Image-conditioned video diffusion models can now be fine-tuned to produce more realistic motion dynamics and long-term temporal coherence via a novel reward-driven approach that avoids common pitfalls like reward hacking.
Forget retraining: OmniVTON++ achieves state-of-the-art virtual try-on across diverse datasets and garment types using a completely training-free approach.