UC Santa CruzApr 3, 2026arXiv:2604.03118

Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

Xingtong Ge, Yi Zhang, Yushi Huang, Dailan He, Xiahong Wang, Bingqi Ma, Guanglu Song

AI Summary

This paper introduces Self-Consistent Distribution Matching Distillation (SC-DMD) to improve the quality of distilled video generation models with very low inference budgets (2-4 NFEs). SC-DMD regularizes the consistency of denoising updates across timesteps to prevent drift, addressing limitations of trajectory-style and standard distribution matching distillation. For autoregressive models, they further propose cache-distribution-aware training, aligning features based on KV cache quality to improve real-time video generation.

Key Contribution

Real-time video generation gets a boost: Salt achieves sharper, more dynamic videos at extremely low inference budgets by explicitly enforcing consistency across denoising steps.

Abstract

Distilling video generation models to extremely low inference budgets (e.g., 2--4 NFEs) is crucial for real-time deployment, yet remains challenging. Trajectory-style consistency distillation often becomes conservative under complex video dynamics, yielding an over-smoothed appearance and weak motion. Distribution matching distillation (DMD) can recover sharp, mode-seeking samples, but its local training signals do not explicitly regularize how denoising updates compose across timesteps, making composed rollouts prone to drift. To overcome this challenge, we propose Self-Consistent Distribution Matching Distillation (SC-DMD), which explicitly regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, we further treat the KV cache as a quality parameterized condition and propose Cache-Distribution-Aware training. This training scheme applies SC-DMD over multi-step rollouts and introduces a cache-conditioned feature alignment objective that steers low-quality outputs toward high-quality references. Across extensive experiments on both non-autoregressive backbones (e.g., Wan~2.1) and autoregressive real-time paradigms (e.g., Self Forcing), our method, dubbed \textbf{Salt}, consistently improves low-NFE video generation quality while remaining compatible with diverse KV-cache memory mechanisms. Source code will be released at \href{https://github.com/XingtongGe/Salt}{https://github.com/XingtongGe/Salt}.

Computer Vision Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

Related Papers