Haowei Zhu

Pseudocode of TAP 1:denoiser fθf_{\theta}, predictor set 𝒫\mathcal{P}, window NN, distance d(⋅,⋅)d(\cdot,\cdot) 2:Ch←∅,Cr←∅C_{h}\leftarrow\varnothing,\;C_{r}\leftarrow\varnothing ⊳\triangleright compact cache: first-layer modulated input and residual 3:for t←Tt\leftarrow T downto 11 do 4: 𝐱t←\mathbf{x}_{t}\leftarrow current model input 5: 𝐡t←Modulate(Norm1(𝐱t),𝐬t,𝐠t)\mathbf{h}^{t}\leftarrow\mathrm{Modulate}(\mathrm{Norm}_{1}(\mathbf{x}_{t}),\mathbf{s}_{t},\mathbf{g}_{t}) ⊳\triangleright Eq.(5) 6: if tmodN=0t\bmod N=0 then 7: 𝐫t←fθ(𝐱t,t)−𝐱t\mathbf{r}_{t}\leftarrow f_{\theta}(\mathbf{x}_{t},t)-\mathbf{x}_{t} ⊳\triangleright full residual (Eq.(6)) 8: Ch←𝐡t,Cr←𝐫tC_{h}\leftarrow\mathbf{h}_{t},\;C_{r}\leftarrow\mathbf{r}_{t} ⊳\triangleright store compact proxies 9: use fθ(𝐱t,t)f_{\theta}(\mathbf{x}_{t},t) as the model output for this step 10: else 11: for all p∈𝒫p\in\mathcal{P} do ⊳\triangleright parallel prediction from cached proxies (e.g., Taylor variants) 12: 𝐡^t,p←Predict(p,Ch)\widehat{\mathbf{h}}_{t,p}\leftarrow\mathrm{Predict}(p,C_{h}) ⊳\triangleright (Eq.(4)) 13: end for 14: for all tokens (b,n)(b,n) do 15: p, O_{r}=2 with λ=4\lambda=4, and jointly varying both distance and order produced the largest improvement (from 0.95 to 0.99). Performance continues to improve as the predictor family grows but with diminishing returns and eventual saturation. Increasing granularity (e.g., using δ=0.1\delta=0.1 instead of δ=1\delta=1) produces only a small additional gain (about 0.005 ImageReward), so we use δ=1\delta=1 by default for simplicity. Two additional observations emerge. First, including zeroth-order predictors (order 0) is particularly valuable: they are more robust to abrupt, noncontinuous token dynamics and therefore complement higher-order predictors, yielding larger gains than using only high-order variants. Second, shifting the prediction window to the left (earlier expansion points) gives notable improvements because it avoids extrapolating beyond a token’s Taylor convergence radius; moving the window to the right (e.g., [k,k+2][k,k{+}2]) yields little benefit. These findings validate the assumptions and effectiveness of our method. Figure 3: Visualization results. On FLUX.1-dev, TAP delivers higher speedup without quality loss. Figure 4: Visualization of video generation. “A happy fuzzy panda playing guitar nearby a campfire, snow mountain in the background”. Comparison with Global Predictors. We constructed a pool of 30 candidate predictors by systematically varying Taylor expansion order and prediction horizon, and evaluated each predictor’s generative performance in isolation. The ImageReward of individual global predictors varied roughly between 0.86 and 0.92, while PSNR ranged approximately from 14.51 to 15.32, no single global predictor was best across all acceleration regimes. Instead, TAP’s probe-driven, per-token selection adaptively fuses these diverse predictors and consistently outperforms any individual predictor, showing that the improvements arise from intelligently combining complementary predictors rather than from a single optimal sampler. 4.4 Qualitative Analysis We present representative visualizations in Figure 3 and Figure 4. Across both image and video examples, cache-only and global-forecast baselines exhibit noticeable degradation at high acceleration ratios, manifesting as blurred details, distorted object geometry, and misalignment with text conditions. In contrast, TAP effectively preserves fine-grained textures, structural integrity, and visual consistency. This advantage comes from TAP’s token-adaptive assignment, which selects the best predictor for each token at every timestep and thus preserves perceptual fidelity even at high acceleration ratios. These qualitative observations match the quantitative gains reported above. 5 Conclusion We present TAP, a probe-driven, token-adaptive diffusion-acceleration framework that makes per-token predictions based on a lightweight proxy. TAP is highly efficient, fully parallelizable, and compatible with a wide range of predictor designs. It achieves substantial speedups in diffusion sampling with minimal memory and compute overhead while preserving perceptual quality. Experiments demonstrate consistent gains across both image and video models. References [1] A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, et al. (2023) Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127. Cited by: §1. [2] D. Bolya and J. Hoffman (2023) Token merging for fast stable diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4599–4603. Cited by: §2.2. [3] J. Bu, P. Ling, Y. Zhou, Y. Wang, Y. Zang, T. Wu, D. Lin, and J. Wang (2025) DiCache: let diffusion model determine its own cache. arXiv preprint arXiv:2508.17356. Cited by: §2.2, §9. [4] P. Chen, M. Shen, P. Ye, J. Cao, C. Tu, C. Bouganis, Y. Zhao, and T. Chen (2024) Δ\Delta-DiT: a training-free acceleration method tailored for diffusion transformers. arXiv preprint arXiv:2406.01125. Cited by: §1, §2.2. [5] Z. Cheng (2025) ParaAttention: context parallel attention that accelerates dit model inference with dynamic caching. Note: https://github.com/chengzeyi/ParaAttentionAccessed: YYYY-MM-DD Cited by: §3.2. [6] G. Fang, X. Ma, and X. Wang (2023) Structural pruning for diffusion models. arXiv preprint arXiv:2305.10924. Cited by: §2.2. [7] L. Feng, S. Zheng, J. Liu, Y. Lin, Q. Zhou, P. Cai, X. Wang, J. Chen, C. Zou, Y. Ma, and L. Zhang (2025) HiCache: training-free acceleration of diffusion models via hermite polynomial-based feature caching. arXiv. External Links: 2508.16984, Document Cited by: §8. [8] J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi (2021) Clipscore: a reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718. Cited by: §4.1. [9] J. Ho, A. Jain, and P. Abbeel (2020-12) Denoising Diffusion Probabilistic Models. arXiv. Note: arXiv:2006.11239 [cs] External Links: Link, Document Cited by: §1. [10] J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models. Advances in neural information processing systems 33, pp. 6840–6851. Cited by: §2. [11] Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, et al. (2024) Vbench: comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21807–21818. Cited by: §4.1. [12] M. Kim, S. Gao, Y. Hsu, Y. Shen, and H. Jin (2024) Token fusion: bridging the gap between token pruning and token merging. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1383–1392. Cited by: §2.2. [13] S. Kim, H. Lee, W. Cho, M. Park, and W. W. Ro (2025) Ditto: accelerating diffusion model via temporal value similarity. In Proceedings of the 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Cited by: §2.2. [14] W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. (2024) Hunyuanvideo: a systematic framework for large video generative models. arXiv preprint arXiv:2412.03603. Cited by: §4.1, §4.3. [15] B. F. Labs (2024) FLUX. Note: https://github.com/black-forest-labs/flux Cited by: §4.1, §4.2.1. [16] E. Levin and O. Fried (2025) Differential diffusion: giving each pixel its strength. In Computer Graphics Forum, pp. e70040. Cited by: §3.2. [17] X. Li, Y. Liu, L. Lian, H. Yang, Z. Dong, D. Kang, S. Zhang, and K. Keutzer (2023) Q-diffusion: quantizing diffusion models. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Vol. , pp. 17489–17499. External Links: Document Cited by: §2.2. [18] Y. Li, H. Wang, Q. Jin, J. Hu, P. Chemerys, Y. Fu, Y. Wang, S. Tulyakov, and J. Ren (2024) Snapfusion: text-to-image diffusion model on mobile devices within two seconds. Advances in Neural Information Processing Systems 36. Cited by: §2.2. [19] F. Liu, S. Zhang, X. Wang, Y. Wei, H. Qiu, Y. Zhao, Y. Zhang, Q. Ye, and F. Wan (2024) Timestep embedding tells: it’s time to cache for video diffusion model. External Links: 2411.19108

Papers on Lattice

Total citations

Topics

Publication activitypapers/week, last 8 weeks

Research focus

Computer Vision (1)Inference & Quantization (1)Training Efficiency & Optimization (1)

Frequent co-authors

Tingxuan Huang (1)Xing Wang (1)Tianyu Zhao (1)Jiexi Wang (1)

Papers (1)

Mar 4, 2026

TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration

Achieve significant speedups in diffusion model inference, without training, by adaptively selecting the best predictor for each token at each step based on a low-cost probe of the first layer.

Haowei Zhu, Tingxuan Huang, Xing Wang +5

Computer Vision Inference & Quantization Training Efficiency & Optimization

Search

Haowei Zhu

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (1)