Search papers, labs, and topics across Lattice.
The paper investigates outlier tokens in Diffusion Transformers (DiTs) for image generation, finding their presence in both the ViT encoder and DiT denoiser of RAE-DiT pipelines. Simply masking these high-norm tokens doesn't improve performance, suggesting the issue stems from corrupted local patch semantics rather than extreme values alone. To mitigate this, the authors introduce Dual-Stage Registers (DSR), a register-based intervention, which improves generation quality and reduces outlier artifacts across ImageNet and large-scale text-to-image generation tasks.
Outlier tokens in Diffusion Transformers aren't just extreme values; they corrupt local patch semantics, and can be tamed with Dual-Stage Registers to boost image generation quality.
We study outlier tokens in Diffusion Transformers (DiTs) for image generation. Prior work has shown that Vision Transformers (ViTs) can produce a small number of high-norm tokens that attract disproportionate attention while carrying limited local information, but their role in generative models remains underexplored. We show that this phenomenon appears in both the encoder and denoiser of modern Representation Autoencoder (RAE)-DiT pipelines: pretrained ViT encoders can produce outlier representations, and DiTs themselves can develop internal outlier tokens, especially in intermediate layers. Moreover, simply masking high-norm tokens does not improve performance, indicating that the problem is not only caused by a few extreme values, but is more closely related to corrupted local patch semantics. To address this issue, we introduce Dual-Stage Registers (DSR), a register-based intervention for both components: trained registers when available, recursive test-time registers otherwise, and diffusion registers for the denoiser. Across ImageNet and large-scale text-to-image generation, these interventions consistently reduce outlier artifacts and improve generation quality. Our results highlight outlier-token control as an important ingredient in building stronger DiTs.