Yangyang Xu

B∑br~b\bar{r}=\frac{1}{B}\sum_{b}\tilde{r}^{b}: A~b=r~b−r¯\tilde{A}^{b}=\tilde{r}^{b}-\bar{r} and Ab=rb−r¯A^{b}=r^{b}-\bar{r}. We use only recentering (no standard-deviation normalization) to avoid difficulty bias induced by small reward variance [22]. Adversarial Online discriminator update. In unverifiable generation tasks, the generator distribution pθp_{\theta} inevitably drifts from the initial training support, rendering static rewards unreliable and susceptible to hacking. To mitigate this, we employ an iterative adversarial update strategy. Periodically, we refresh the discriminator by keep optimizing Eq. (7), using the most recent on-policy samples x~0b\tilde{x}_{0}^{b} as negative examples against the ground-truth videos. In this framework, the reward approximates the log-density ratio log⁡pdata(𝐱)pθ(𝐱)\log\frac{p_{\text{data}}(\mathbf{x})}{p_{\theta}(\mathbf{x})} in feature space, ensuring the signal remains accurate and robust to distribution shift throughout training. The complete SHIFT procedure is outlined in Algorithm 1. Remark. While SHIFT’s loss structure (Eq. (12)) bears a superficial resemblance to the reward-weighted regression plus SFT objective of Lee et al. [17], the two methods differ in several fundamental respects. First, SHIFT derives its objective from a principled Forward KL formulation (Eq. (9)), whereas Lee et al. adopt a heuristic combination of RWR and SFT without such theoretical grounding. Second, SHIFT replaces raw rewards with group-relative advantages and employs adversarial reward-model co-training to prevent reward hacking—mechanisms absent in Lee et al. Third, our reward models are fully automatic, motion-aware discriminators trained without any human annotation; this is essential for video motion alignment, where pixel-level temporal dynamics are practically infeasible for humans to label at scale. Finally, although SHIFT alternates between generator and discriminator updates, it differs fundamentally from standard GAN training: the discriminator provides only a scalar reward signal used to compute advantages, and its gradient is never backpropagated into the generator. As shown by Pfau and Vinyals [28], GANs can be viewed as actor-critic methods where coupled gradient-based min-max optimization is the root cause of training instability; SHIFT sidesteps this by fully decoupling the two optimizations, retaining the discriminator’s adaptive distribution-awareness without inheriting adversarial gradient dynamics. 5 Experiments Implementation. We validate SHIFT on two video diffusion models: SVD (∼\sim1., V for the SVD model. FVD are tested on the DAVIS2017 validation set. Higher (↑\uparrow) is better; lower (↓\downarrow) is better. Best is bold; second-best is underlined. VBench Appearance (↑\uparrow), Zhaotong Yang is with the School of Computer Science and Technology, Ocean University of China, Qingdao, China, and the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China. E-mail: ztyang808@njust.edu.cn. Yong Du is with the School of Computer Science and Technology, Ocean University of China, Qingdao, China, and with the Sanya Oceanographic Institution, Ocean University of China, Sanya, China. E-mail: csyongdu@ouc.edu.cn. Shengfeng He is with the School of Computing and Information Systems, Singapore Management University, Singapore. Email: shengfenghe@smu.edu.sg. Yuhui Li, Xinzhe Li, and Junyu Dong are with the School of Computer Science and Technology, Ocean University of China, Qingdao, China. E-mail: liyuhui1150@stu.ouc.edu.cn, lixinzhe@stu.ouc.edu.cn, dongjunyu@ouc.edu.cn. Yangyang Xu is with the School of Intelligence Science and Engineering, Harbin Institute of Technology, Shenzhen, China. Email: xuyangyang@hit.edu.cn. Jian Yang is with the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China. E-mail: csjyang@njust.edu.cn

Papers on Lattice

Total citations

Topics

Publication activitypapers/week, last 8 weeks

Research focus

Computer Vision (1)Multimodal Models (1)

Frequent co-authors

Xi Ye (1)Wenjia Yang (1)Duo Su (1)Mengfei Xia (1)

Papers (2)

Mar 18, 2026

Mar 18, 2026·also Cohere

SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning

Image-conditioned video diffusion models can now be fine-tuned to produce more realistic motion dynamics and long-term temporal coherence via a novel reward-driven approach that avoids common pitfalls like reward hacking.

Xi Ye, Wenjia Yang, Yangyang Xu +2

Computer Vision Multimodal Models

Feb 16, 2026

OmniVTON++: Training-Free Universal Virtual Try-On with Principal Pose Guidance

Forget retraining: OmniVTON++ achieves state-of-the-art virtual try-on across diverse datasets and garment types using a completely training-free approach.

Zhaotong Yang, Shengfeng He, Yuhui Li +3

Search

Yangyang Xu

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (2)