2020-07-30 –, Red Track
We present a self-contained approach for the development of massively scalable multi-GPU solvers for coupled nonlinear systems of partial differential equations (PDEs) in Julia. The approach encompasses numerics, implementation and performance evaluation. We showcase several 2-D and 3-D Multi-GPU PDE solvers as, e.g., a solver for spontaneous nonlinear porous flow localization in 3-D which scales nearly ideally on thousands of GPUs.
The widely applicable approach we present relies on the usage of a powerful stencil-based iterative method which enables to efficiently converge to the time-dependent implicit solution for strongly nonlinear problems. The method optimally suits both shared and distributed memory parallelism.
The implementation approach enables a straightforward development of a single Julia code that can be readily deployed on a single CPU thread or on thousands of GPUs/CPUs. We have instantiated the approach in the Julia packages ParallelStencil
and ImplicitGlobalGrid
. ParallelStencil
empowers domain scientists to write architecture-agnostic high-level code for parallel high-performance stencil computations on GPUs and CPUs. ParallelStencil
uses CUDAnative
for computations on GPUs and Base.Threads
for computations on CPUs. ImplicitGlobalGrid
renders the distributed parallelization of stencil-based GPU and CPU applications on a regular (staggered) grid nearly trivial. ImplicitGlobalGrid
relies on the Julia MPI wrapper, MPI.jl
, to perform halo updates close to hardware limit and leverages CUDA-aware MPI for GPU applications. We have designed both ParallelStencil
and ImplicitGlobalGrid
for simplest possible usage by domain-scientists, rendering fast and interactive development of massively scalable high performance Multi-GPU applications readily accessible to them.
We conduct the performance evaluation with a simple metric for iterative PDE solvers. The metric measures effective memory throughput and is complementary to traditional metrics.
We demonstrate the broad applicability of our approach showcasing multiple 2-D and 3-D Multi-GPU PDE solvers, as, for instance, a solver for spontaneous nonlinear multi-physics porous flow localization in 3-D. As reference, we ported the latter solver from MPI+CUDA C to Julia and it achieves 95% of the performance of the original solver and a nearly ideal parallel efficiency on thousands of NVIDIA Tesla P100 GPUs on the Piz Daint supercomputer at the Swiss National Supercomputing Centre, CSCS. We evaluate the presented solvers' performance and scalability on Piz Daint. The majority of the presented solvers are being made publicly available as part of the documentation of the packages ParallelStencil
and ImplicitGlobalGrid
.
Co-authors: Ludovic Räss¹ ², Grzegorz Kwasniewski¹, Benjamin Malvoisin³, Yury Podladchikov³
¹ ETH Zurich | ² Stanford University | ³ University of Lausanne
Computational Scientist and responsible for Julia Computing at the Swiss National Supercomputing Centre (CSCS), ETH Zurich