Search papers, labs, and topics across Lattice.
PARE introduces a pruning and adaptive routing strategy to improve the efficiency of Video Diffusion Transformers (DiTs) by compressing both width and depth. They leverage structure-aware pruning that accounts for the spatial/temporal specialization of attention heads and train a lightweight router conditioned on denoising timestep and visual content to dynamically select which blocks to execute. Experiments on Wan2.1-14B demonstrate that PARE significantly reduces per-step computation while maintaining video generation quality.
By intelligently pruning attention heads based on their spatial or temporal roles and adaptively routing denoising steps through the network, PARE achieves significant computational savings in video generation without sacrificing quality.
Video Diffusion Transformers (DiTs) generate high-quality videos but demand substantial compute due to wide blocks, deep architectures, and iterative sampling. Recent methods reduce cost by compressing width, depth, or sampling steps, but typically commit to a fixed architecture that cannot adapt to individual inputs or denoising stages. We propose PARE (Pruning and Adaptive Routing for Efficient video generation), which jointly compresses width and depth with structure-aware pruning and input-adaptive routing. For width, we observe that attention heads specialize into spatial and temporal roles, and design importance scoring that accounts for this distinction to prevent motion-critical temporal heads from being pruned prematurely. For depth, we train a lightweight router conditioned on denoising timestep and visual content to dynamically select which blocks to execute at each step, enabling per-input compute adaptation rather than static block removal. A progressive pipeline first recovers width-pruned quality via distillation, then jointly optimizes the student and router to decouple the two learning objectives. Experiments on Wan2.1-14B for both image-to-video and text-to-video generation show that PARE substantially reduces per-step computation while preserving quality across VBench dimensions, and composes with step distillation for further acceleration.