Search papers, labs, and topics across Lattice.
This paper explores the use of FP64 tensor cores on NVIDIA GPUs to accelerate high-order finite element simulations, a critical component in various scientific and engineering applications. By integrating FP64 tensor cores with kernel fusion optimizations within the MFEM library, the authors achieved up to 2x performance gains and 83% energy efficiency improvements on NVIDIA's Grace Hopper GH200 and Grace Blackwell GB200 architectures. The optimized kernels demonstrated near-perfect weak scaling and 90% strong scaling across nearly 10,000 GPUs on the Alps system, showcasing exascale performance.
FP64 tensor cores, previously untapped for large-scale scientific computing, now unlock 2x speedups and 83% energy savings in finite element simulations on NVIDIA's latest GPUs.
Finite element simulations play a critical role in a wide range of applications, from automotive design to tsunami modeling and computational electromagnetics. Performing these simulations efficiently at the high resolutions needed for practical applications and scientific insights necessitates the use of high-order methods and large-scale supercomputing. While much progress has been made in porting finite element codes to GPU systems in recent years, additional improvements in the efficiency and computational speed of GPU-accelerated high-order finite element simulations are in constant demand. In this paper, we demonstrate that the FP64 tensor cores on NVIDIA GPUs can be used to further accelerate such simulations, achieving significant speedups in key kernels of MFEM, a scalable open-source finite element library widely used in HPC applications. By integrating FP64 tensor cores with kernel fusion optimizations, we were able to achieve up to 2$\times$ performance gains and up to 83% energy efficiency gains on NVIDIA's Grace Hopper GH200 and Grace Blackwell GB200 architectures. To the best of our knowledge, this is the first time that FP64 tensor cores have been directly programmed to accelerate large-scale finite element scientific computing applications. We demonstrate the performance of the optimized kernels at exascale by showing near-perfect weak scaling efficiency and 90% strong scaling efficiency across nearly 10,000 GPUs on the Alps system. The new algorithms and MFEM enhancements directly benefit complex production codes, including the 2025 Gordon Bell Prize-winning application for real-time tsunami forecasting.