JuliaCon 2022 (Times are UTC)

High-performance xPU Stencil Computations in Julia
07-27, 15:20–15:30 (UTC), Purple

We present an efficient approach for writing architecture-agnostic parallel high-performance stencil computations in Julia. Powerful metaprogramming, costless abstractions and multiple dispatch enable writing a single code that is usable for both productive prototyping on a single CPU and for production runs on GPU or CPU workstations or supercomputers. Performance similar to CUDA C is achievable, which is typically a large improvement over reachable performance with CUDA.jl Array programming.

Our approach for the expression of architecture-agnostic high-performance stencil computations relies on the usage of Julia's powerful metaprogramming capacities, costless high-level abstractions and multiple dispatch. We have instantiated the approach in the Julia package ParallelStencil.jl. Using ParallelStencil, a simple call to the macro @parallel is enough to parallelize and launch a kernel that contains stencil computations, which can be expressed explicitly or with math-close notation. The package used underneath for parallelization is defined in a initialization call beforehand. Currently supported are CUDA.jl for running on GPU and Base.Threads for CPU. Leveraging metaprogramming, ParallelStencil automatically generates high-performance code suitable for the target hardware, and automatically derives kernel launch parameters from the kernel arguments by analyzing the extensions of the contained arrays. A set of architecture-agnostic low level kernel language constructs allows for explicit low level kernel programming when useful, e.g., for the explicit control of shared memory on the GPU (these low level constructs are GPU-computing-biased).

Arrays are automatically allocated on the hardware chosen for the computations (GPU or CPU) when using the allocation macros provided by ParallelStencil, avoiding any need of code duplication. Moreover, the allocation macros are fully declarative in order to let ParallelStencil choose the best data layout in memory. Notably, logical arrays of structs (or of small arrays) can be either laid out in memory as arrays of structs or as structs of arrays accounting for the fact that each of these allocation approaches has its use cases where it performs best.

ParallelStencil is seamlessly interoperable with packages for distributed parallelization, as e.g. ImplicitGlobalGrid.jl or MPI.jl, in order to enable high-performance stencil computations on GPU or CPU supercomputers. Communication can be hidden behind computation with as simple macro call. The usage of this feature solely requires that communication can be triggered explicitly as it is possible with, e.g, ImplicitGlobalGrid and MPI.jl.

We demonstrate the wide applicability of our approach by reporting on several multi-GPU solvers for geosciences as, e.g., 3-D solvers for poro-visco-elastic twophase flow and for reactive porosity waves. As reference, the latter solvers were ported from MPI+CUDA C to Julia using ParallelStencil and ImplicitGlobalGrid and achieve 90% and 98% of the performance of the original solvers, respectively, and a nearly ideal parallel efficiency on thousands of NVIDIA Tesla P100 GPUs at the Swiss National Supercomputing Centre. Moreover, we have shown in recent contributions that the approach is naturally in no way limited to geosciences: we have showcased a computational cognitive neuroscience application modelling visual target selection using ParallelStencil and MPI.jl and a quantum fluid dynamics solver using the Nonlinear Gross-Pitaevski Equation implemented with ParallelStencil and ImplicitGlobalGrid.

Co-authors: Ludovic Räss¹ ²

¹ ETH Zurich | ² Swiss Federal Institute for Forest, Snow and Landscape Research (WSL)

Geoscientist with strong interests in Julia, HPC, GPUs, and supercomputing. Applications to resolve multi-physics processes in ice dynamics and geodynamics across scales.

This speaker also appears in: