Search papers, labs, and topics across Lattice.
This paper introduces a fully convolutional diffusion model (FCDM) based on the ConvNeXt architecture, tailored for conditional diffusion modeling. The authors demonstrate that FCDM-XL achieves competitive performance compared to DiT-XL/2 while using only 50% of the FLOPs and requiring significantly fewer training steps at 256x256 and 512x512 resolutions. The model's training efficiency is highlighted by its ability to be trained on a 4-GPU system.
ConvNets strike back: a ConvNeXt-based diffusion model matches Transformer performance at half the FLOPs and 7x faster training, all on just 4 GPUs.
Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and hardware friendliness--the attributes that established ConvNets as the efficient vision backbone--have seen limited exploration in modern generative modeling. Here we introduce the fully convolutional diffusion model (FCDM), a model having a backbone similar to ConvNeXt, but designed for conditional diffusion modeling. We find that using only 50% of the FLOPs of DiT-XL/2, FCDM-XL achieves competitive performance with 7$\times$ and 7.5$\times$ fewer training steps at 256$\times$256 and 512$\times$512 resolutions, respectively. Remarkably, FCDM-XL can be trained on a 4-GPU system, highlighting the exceptional training efficiency of our architecture. Our results demonstrate that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models, reviving ConvNeXt as a simple yet powerful building block for efficient generative modeling.