Search papers, labs, and topics across Lattice.
×H×W\mathbf{X}\in\mathbb{R}^{T\times 3\times H\times W}. Unlike fixed grid sampling, this stochastic query initialization acts as an implicit ensemble over motion constraints, enhancing robustness against reward hacking. We treat the evolution of pixel trajectories as a temporal stochastic process. To incorporate this motion-based inductive bias, we explicitly model the drift component via the instantaneous velocity of tracked points, τd[t]=τ[t]−τ[t−1]\tau_{d}[t]=\tau[t]-\tau[t-1]. Trajectories are estimated using CoTracker3 [16], which additionally provides visibility masks τv\tau_{v} and confidence scores τc\tau_{c}. These components are concatenated to form a motion state descriptor 𝐬=[τd,τv,τc]\mathbf{s}=[\tau_{d},\tau_{v},\tau_{c}]. To further capture global structural dependencies, we augment this descriptor with dense correlation maps representing spatial affinities between query points and their temporal neighbors. Reward Training. Both the instantaneous and long-term reward models are parameterized as Vision Transformers (ViTs) and trained as binary discriminators to distinguish between real videos drawn from the dataset pdatap_{\text{data}} and synthetic samples generated by the diffusion model. Formally, we optimize the discriminator parameters ω\omega by minimizing the binary cross-entropy loss: ℒrew(ω)=−𝔼x∼pdata[logDω(x)]−𝔼x~∼pθ[log(1−Dω(x~))].\mathcal{L}_{\text{rew}}(\omega)=-\mathbb{E}_{x\sim p_{\text{data}}}[\log D_{\omega}(x)]-\mathbb{E}_{\tilde{x}\sim p_{\theta}}[\log(1-D_{\omega}(\tilde{x}))]. (7) The reward r(𝐱)r(\mathbf{x}) is defined as the raw logit output of the discriminator, r(𝐱)=logit(Dω(𝐱))r(\mathbf{x})=\text{logit}(D_{\omega}(\mathbf{x})), clipped to a stable range [rmin,rmax][r_{\min},r_{\max}]. We explicitly opt for raw logits over sigmoid-transformed probabilities to avoid gradient saturation, ensuring stronger learning signals particularly when the discriminator is confident (see ablation study). 4.2 SHIFT: Smooth Hybrid Fine-tuning. To address aforementioned challenges in section 3, we present SHIFT (Smooth HybrId Fine-Tuning), illustrated in Figure 1(b), a data-regularized reinforcement learning framework designed for efficient video diffusion alignment. Instead of constraining the model to a drifting reference policy via Reverse KL, we anchor the model to the stationary ground-truth data distribution pdatap_{\text{data}} using Forward KL. This is particularly valid for motion alignment, where we assume ground-truth video data possesses perfect motion fidelity. Deriving the Forward-KL Target. Replacing the reference policy in Eq. (2) with the stationary data distribution, we define optimal target p∗(x)p^{*}(x) as: p∗(x)∝pdata(x)exp(r(x)β).p^{*}(x)\propto p_{\text{data}}(x)\exp\left(\frac{r(x)}{\beta}\right). (8) While minimizing the Reverse KL between this target and pθp_{\theta} is intractable due to the unnormalized density of p∗p^{*}, minimizing the Forward KL divergence is straightforward: 𝒥(θ)=DKL(p∗(x)||pθ(x))≡argmaxθ𝔼x∼p∗[logpθ(x)].\mathcal{J}(\theta)=D_{\text{KL}}(p^{*}(x)||p_{\theta}(x))\equiv\arg\max_{\theta}\mathbb{E}_{x\sim p^{*}}[\log p_{\theta}(x)]. (9) Optimization via Diffusion Loss Proxy. Evaluating 𝔼x∼p∗[…]\mathbb{E}_{x\sim p^{*}}[\dots] is non-trivial because we cannot directly sample from p∗p^{*}. However, since p∗p^{*} is a product of the data distribution and the reward exponent, its probability mass concentrates in two regions: (1) the support of pdatap_{\text{data}}, and (2) regions of high reward r(x)r(x). We therefore approximate the gradient 𝔼x∼p∗[∇θlogpθ(x)]\mathbb{E}_{x\sim p^{*}}[\nabla_{\theta}\log p_{\theta}(x)] with a mixture estimator over two accessible sources. An offline anchor (x∼pdatax\sim p_{\text{data}}) samples directly from the dataset to preserve the generation quality and motion coherence of the base distribution, while an online exploration term (x~∼pθ\tilde{x}\sim p_{\theta}) uses the model’s own rollouts, weighted by their relative advantage A(x~)A(\tilde{x}), to cover high-reward regions favored by exp(r(x~)/β)\exp(r(\tilde{x})/\beta). This yields: 𝒥(θ)≈𝔼x~∼pθ[exp(A(x~)β)logpθ(x~)]⏟Online Exploration+𝔼x∼pdata[logpθ(x)]⏟Offline Anchor.\mathcal{J}(\theta)\approx\underbrace{\mathbb{E}_{\tilde{x}\sim p_{\theta}}\left[\exp\left(\frac{A(\tilde{x})}{\beta}\right)\log p_{\theta}(\tilde{x})\right]}_{\text{Online Exploration}}+\underbrace{\mathbb{E}_{x\sim p_{\text{data}}}[\log p_{\theta}(x)]}_{\text{Offline Anchor}}. (10) Algorithm 1 SHIFT: Smooth Hybrid Fine-tuning Input: pre-trained denoiser ϵbase\epsilon_{\text{base}}, reward model rωr_{\omega}, dataset pdatap_{\text{data}}, iterations NN, rollout size BB, inner steps KK, temperature β\beta. Initialize: θ←θbase\theta\leftarrow\theta_{\text{base}}. 1: for n=1n=1 to NN do 2: Sample real data {x0b}b=
1
0
2
Image-conditioned video diffusion models can now be fine-tuned to produce more realistic motion dynamics and long-term temporal coherence via a novel reward-driven approach that avoids common pitfalls like reward hacking.