2023-07-28 –, 32-124
We present an efficient approach for the development of 3-D partial differential equation solvers that are able to tackle the hardware limit of modern GPUs and scale to the world's largest supercomputers. The approach relies on the accelerated pseudo-transient method and on the automatic generation of on-chip memory usage-optimized computation kernels for the implementation. We report performance and scaling results on LUMI and Piz Daint, an AMD-GPU and a NVIDIA-GPU supercomputer.
For the development of massively scalable high performance 3-D partial differential equation (PDE) solvers we rely on the usage of a powerful matrix-free pseudo-transient iterative algorithm. The algorithm allows for formulating the update rules for fields in a form close to the explicit time integration. This formulation is optimally suited for both shared and distributed memory parallelization.
With respect to implementation, our approach is instantiated in the packages ParallelStencil.jl and ImplicitGlobalGrid.jl. It is fully architecture-agnostic and, for GPU, it includes the automatic generation of on-chip memory usage-optimized computation kernels as well as the automatic determination of their launch parameters. On-chip memory usage-optimization is achieved by explicitly reading large arrays through the memory shared by the threads of a block - when beneficial and when hardware resources allow for it - and by steering register usage in order to keep data that is to be reused whenever possible on the chip. The optimizations are applicable to complex real world applications: a simplified, less on-chip memory requiring optimization strategy can be used for a part of the kernel input arrays when the most aggressive optimization strategy would lead to a resource exhaustion. Kernel launch parameters as blocks, threads and required shared memory can be automatically defined in agreement with the applied optimizations. Furthermore, communication can be automatically hidden behind computation, including with on-chip memory usage-optimization activated.
We compare the performance achieved for some representative computation kernels against the performance obtained using alternative ways to express equivalent computations in Julia, for example using GPU array broadcasting or straightforward CUDA.jl kernel programming. Finally, we report the scaling of some earth science applications on the European flagship supercomputer LUMI at CSC in Finland, hosting AMD GPUs, and on the Piz Daint supercomputer at the Swiss National Supercomputing Centre in Switzerland, hosting NVIDIA GPUs.
Co-authors: Ludovic Räss¹ ², Ivan Utkin¹
¹ ETH Zurich | ² Swiss Federal Institute for Forest, Snow and Landscape Research (WSL)
GPU computing and geo-HPC. Researcher at ETH Zurich, Switzerland.
Computational Scientist and Responsible for Julia computing at the Swiss National Supercomputing Centre, ETH Zurich
Ivan Utkin's scientific record revolves around numerical modelling of natural processes in geosciences with a strong emphasis on resolving coupled multi-physics interactions. Specifically, the record includes studies of fluid flows in porous rocks, including compaction-driven flow focussing, ground displacement due to the elastic response of rocks to the fluid pressure, and the influence of chemical interactions between fluid and rocks on the flow dynamics and observed chemical composition in rocks. A particular focus of Utkin's research is the development of numerical techniques to resolve processes in high resolution using massively parallel computing architectures such as graphics processing units (GPUs).