Juliacon 2024

Sustainable Development of Stencil-based HPC Applications
07-12, 15:50–16:00 (Europe/Amsterdam), Else (1.3)

We present a successful approach for a sustainable development of stencil-based HPC applications in Julia. The approach includes automatic performance optimization for hardware-agnostic high-level kernels, data layout abstractions enabling memory layouts optimized per backend, and GPU-aware inter-process communication that is automatically hideable behind computation. We demonstrate on multiple examples near optimal performance and scaling on thousands of GPUs.


In the context of the rapid evolution of hardware over the last decades and the fastly growing hardware diversity, the High Performance Computing (HPC) community has identified the three "P"s, (scalable) Performance, (performance) Portability, Productivity, as key goals for sustainable HPC developments. In order to bring these orthogonal goals under one hood, the challenge needs to be split up into manageable tasks, where the demanding computer science tasks are largely solved in what we call "HPC building blocks". We present here a successful approach for sustainable development of stencil-based HPC applications, where the HPC building blocks are implemented in the three Julia packages ParallelStencil.jl, ImplicitGlobalGrid.jl and CellArrays.jl.

In our approach, numerical algorithms are formulated with architecture-agnostic math-close code, which hides computer science aspects as, e.g., parallelization and optimization as much as possible or sensible. Writing such code leads to a productivity like in a classical prototyping environment. The HPC building blocks turn then these codes into massively scalable high performance multi-GPU/CPU applications or frameworks. More concretely, the HPC building blocks 1) perform automatic performance optimization and parallelization for hardware-agnostic kernels written explicitly or in math-close notation (ParallelStencil.jl), 2) contain data layout abstractions enabling memory layouts optimized per backend (CellArrays.jl), and 3) enable GPU-aware distributed parallelization that is automatically hideable behind computation (ImplicitGlobalGrid.jl).

Other important aspects of our approach are 1) extensibility of the math-close notation for writing computation kernels, 2) integration into the ecosystem enabling, e.g., Enzyme-powered automatic differentiation using high level syntax, and 3) compatibility of our architecture-specific code generation with Julia's extension feature. We discuss these points briefly.

We report performance and scaling of benchmarks and real-world applications on the EuroHPC flagship AMD GPU supercomputer, LUMI, and on the Nvidia GPU supercomputer Piz Daint at the Swiss National Supercomputing Centre. We show that near optimal performance and scaling on thousands of GPUs on the world's fastest supercomputers is possible.

Computational Scientist | Responsible for Julia computing
Swiss National Supercomputing Centre (CSCS), ETH Zurich

This speaker also appears in: