DAMOHKUNJUZJUMay 20, 2026arXiv:2605.20708

Rethinking Cross-Layer Information Routing in Diffusion Transformers

AI Summary

This paper analyzes cross-layer information flow in Diffusion Transformers (DiTs), identifying issues like magnitude inflation, gradient decay, and redundancy stemming from the standard residual connections. To address these, they propose Diffusion-Adaptive Routing (DAR), a learnable, timestep-adaptive aggregation method for sublayer outputs. DAR significantly improves DiT performance, achieving a 2.11 FID improvement on ImageNet and accelerating training, highlighting cross-layer routing as a key area for DiT optimization.

Key Contribution

DiTs are leaving performance on the table by using vanilla residual connections, and a simple timestep-adaptive routing mechanism can unlock significant gains in both training efficiency and final image quality.

Abstract

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (DAR), a drop-in residual replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over the history of sublayer outputs. Moreover, the proposed DAR is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet 256times256, DAR improves SiT-XL/2 by 2.11 FID (7.56 vs.\ 9.67) and matches the baseline's converged quality with 8.75times fewer training iterations. Stacked on top of REPA, it yields a 2times training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, DAR can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Rethinking Cross-Layer Information Routing in Diffusion Transformers

Related Papers