JuliaCon 2026

From Stencils to XLA: A Reactant Backend for ParallelStencil.jl
2026-08-13 , Room 3

We present a successful approach for building a Reactant backend for ParallelStencil, a Julia package for high-performance stencil computations. The approach includes the generation of kernel code and data structures that are pre-optimized to serve as optimal input for Reactant to generate efficient and correct GPU, TPU, and CPU code. We report performance of representative stencil mini-apps on latest-generation hardware platforms, including NVIDIA H100 GPUs, AMD MI300a GPUs, and Google TPUs, evaluate it in absolute terms, and compare it with performance obtained with established backends that directly generate hardware-specific low-level code.


Stencil computations are a fundamental class of algorithms in scientific computing, with applications ranging from computational fluid dynamics to image processing. ParallelStencil is a Julia package that enables the conversion of architecture-agnostic, high-level stencil or stencil-like code into high-performance GPU, TPU, and CPU code. To achieve this, two distinct approaches have been developed to build corresponding backends: the first relies on generating hardware-specific low-level code (using CUDA.jl, AMDGPU.jl, Metal.jl, Polyester.jl, or Base.Threads); the second relies on generating generic code that is optimized as input for Reactant, and delegating the hardware-specific code generation to Reactant. Reactant is a Julia package that enables the optimization of Julia functions with MLIR and XLA for high-performance execution on CPUs, GPUs, TPUs and other hardware architectures.

This contribution focuses on the second approach. We describe solutions to challenges encountered in generating kernel code and data structures that are pre-optimized to serve as optimal input for Reactant to generate efficient and correct GPU, TPU, and CPU code. This includes automatic injection of tracing macros to ensure that the control flow is captured correctly, as well as the generation of data structures that are compatible with Reactant's requirements. We also discuss the performance implications of different code patterns and data structures on the efficiency of the generated code.

We report the performance of representative stencil mini-apps with the Reactant backend on latest-generation hardware platforms, including NVIDIA H100 GPUs, AMD MI300a GPUs at the Swiss National Supercomputing Centre (CSCS), and Google TPUs at Google. The mini-apps include a 3D heat diffusion solver and a 3D Navier-Stokes solver using a staggered grid. We evaluate performance in absolute terms using the effective memory throughput metric, and in relative terms by comparing it with performance obtained with the backends that directly generate hardware-specific low-level code.

These results demonstrate the effectiveness of the approach used to build a backend with Reactant and provide insights into how to optimize code and data structures for efficient code generation with Reactant. Furthermore, this work shows how the powerful emerging MLIR- and XLA-based technologies can be integrated into existing domain-scientist-oriented high-performance computing frameworks such as ParallelStencil, providing a path toward broader adoption of these technologies in the HPC community.

Computational Scientist and Responsible for Julia computing, at the Swiss National Supercomputing Centre (CSCS), ETH Zurich

William (Billy) Moses is an Assistant Professor at the University of Illinois in the Computer Science and Electrical and Computer Engineering departments. He received a Ph.D. in Computer Science from MIT, where he also received his M.Eng in electrical engineering and computer science (EECS) and B.S. in EECS and physics. William's research involves creating compilers and program representations that enable performance and use-case portability, thus enabling non-experts to leverage the latest in high-performance computing and ML. He is known as the lead developer of Enzyme, a tool for LLVM/MLIR capable of differentiating code in a variety of languages; Polygeist, a polyhedral compiler and C++ frontend for MLIR; and Reactant, a tool for enabling existing scientific code to run on distributed ML accelerators. He has also worked on the Tensor Comprehensions framework for synthesizing high-performance GPU kernels of ML code, the Tapir compiler for parallel programs, and compilers that use machine learning to better optimize. He is a recipient of the 2026 SIAM Supercomputing Early Career Prize, the 2024 SIGHPC Doctoral Dissertation Award, a DOE Computational Science Graduate Fellowship and the Karl Taylor Compton Prize, MIT's highest student award.