Mar 29, 2026arXiv:2603.27569

Beating vDSP: A 138 GFLOPS Radix-8 Stockham FFT on Apple Silicon via Two-Tier Register-Threadgroup Memory Decomposition

AI Summary

This paper introduces a highly optimized FFT implementation for Apple Silicon GPUs, leveraging a two-tier memory model that exploits the register file and threadgroup memory. By implementing radix-8 Stockham kernels in Metal Shading Language (MSL), the authors achieve 138.45 GFLOPS for N=4096 complex single-precision transforms, surpassing Apple's vDSP/Accelerate baseline by 29%. The work also reveals that threadgroup memory barriers are surprisingly efficient on Apple GPUs, while scattered threadgroup access patterns pose a significant performance bottleneck.

Key Contribution

Apple's own vDSP FFT library gets smoked by a new implementation that's 29% faster, thanks to a clever two-tier memory model exploiting the GPU's register file and threadgroup memory.

Abstract

We present an optimized Fast Fourier Transform (FFT) implementation for Apple Silicon GPUs, achieving 138.45~GFLOPS for $N\!=\!4096$ complex single-precision transforms -- a 29\% improvement over Apple's highly optimized vDSP/Accelerate baseline (107~GFLOPS). Our approach is grounded in a \emph{two-tier local memory model} that formally characterizes the Apple GPU's 208~KiB register file as the primary data-resident tier and the 32~KiB threadgroup memory as an exchange-only tier, extending the decomposition framework established in a 2015 PhD thesis on Intel integrated GPU FFT for radar processing. We implement and evaluate radix-4 and radix-8 split-radix Stockham kernels in Metal Shading Language (MSL), demonstrating that the radix-8 decimation-in-time butterfly with 512 threads yields the best performance. We further present the first investigation of Apple's \texttt{simdgroup\_matrix} 8$\times$8 hardware MMA for FFT butterfly computation and report the counter-intuitive finding that on Apple GPU, threadgroup memory barriers are inexpensive ($\sim$2 cycles) while scattered threadgroup access patterns are the true bottleneck. Our multi-size implementation supports $N\!=\!256$ through $N\!=\!16384$ using a four-step decomposition for sizes exceeding the 32~KiB threadgroup memory limit. All kernels are validated against vDSP reference outputs.

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beating vDSP: A 138 GFLOPS Radix-8 Stockham FFT on Apple Silicon via Two-Tier Register-Threadgroup Memory Decomposition

Related Papers