2021-07-30, 20:10–20:20 (UTC), Blue
In this talk, we will present our work of coupling the Julia runtime with E3SM, an advanced earth system simulation application for supercomputers, and running E3SM with swappable in-situ Julia modules at large scale. The talk includes (1) the Julia runtime coupling with legacy High-Performance Computing (HPC) applications (i.e., E3SM), (2) the design of two in-situ data analysis modules in Julia, and (3) the communication design for E3SM and the in-situ Julia modules.
The Energy Exascale Earth System Model (E3SM) is the Department of Energy's state-of-the-art earth system simulation model. It aims to address the most critical and challenging climate problems by efficiently utilizing DOE’s advanced HPC systems. One of the challenges of E3SM (and other exascale simulations) is the imbalance between the great size of the generated simulation data and the limited storage capacity. This means that post hoc data analysis needs to be replaced with in-situ analysis, which analyzes simulation data as the simulation is running. Our work aims to use Julia to provide data scientists with a high-level and performant interface for developing in-situ data analysis algorithms without directly interacting with complex HPC codes. This talk discusses (1) high-level Julia runtime coupling with E3SM, (2) two in-situ data analysis modules in Julia, and (3) low-level communication between E3SM and the in-situ Julia modules.
In this project, we focus on the Community Atmosphere Model (CAM), which models the atmosphere and is one of E3SM’s coupled modules. Our goal is to study extreme weather events that happen in the atmosphere, such as sudden stratospheric warmings (SSW) that can destabilize the polar vortex and cause extreme cold temperatures on earth surfaces. The primary design consideration of coupling Julia with E3SM is the identification of an appropriate entry point in E3SM’s CAM for calling in-situ Julia modules. CAM is implemented in Fortran and simulates in the timestep style. The control module of CAM has access to the simulation data and is selected to be interfaced with the Julia runtime. To couple E3SM with Julia, (1) we have implemented a Fortran-based in-situ data adapter in the control module of CAM, which takes the CAM simulation data as input and internally passes the data to the Julia runtime. (2) We have implemented a C-based interface between the in-situ data adapter and the in-situ Julia modules. The C interface includes three major functions: initialization, cleanup and worker, which creates an in-situ Julia instance (by loading and initializing the in-situ Julia module from a specified path), destroys the Julia instance, and passes the data from the in-situ adapter to the in-situ Julia instance. Our Fortran in-situ adapter interface calls the worker function at every time step and initialization/cleanup functions at the first/last time step. (3) As E3SM mixes the usage of GNU Make and CMake for combining and compiling different E3SM components, we have added the Julia compilation flags for the C and Fortran interfaces into the CAM CMake file (i.e., header files) and the top-level GNU Make file (i.e., Julia libraries). The in-situ Julia module is only compiled when it is called during runtime, which avoids compiling the whole E3SM if the in-situ Julia module needs to be changed.
We have implemented two data analysis in-situ modules: linear regression and SSW. This linear regression approach models simulation variables as a function of simulation time. It can be used to track trends in variables of interest and to identify important checkpoints in the simulation. SSW characterizes midwinter stratospheric sudden warmings that often cause splitting of the stratospheric polar vortex. By definition, SSW occurs when the zonal mean of the zonal wind becomes reversed (easterly) at 60°N and 10 hPa and lasts for at least 10 consecutive days. This event can lead to extreme temperatures on the surface in northern America.
The worker function in the C interface aims to support efficient low-level data communication between E3SM and the in-situ Julia modules. To run at large-scale, E3SM adopts Message Passing Interface (MPI) and so the data is distributed among all the MPI ranks. Each MPI rank of CAM has access to only a local data block of CAM variables (e.g., velocity and temperature) and passes its local data block in 1D array to its own in-situ Julia instance through the C interface. When the in-situ Julia instance needs remote data (e.g., for computation of SSW) from other in-situ Julia instances, MPI.jl is used to implement the data communication between different in-situ Julia instances. However, one key design challenge is to make sure that E3SM and Julia use the same MPI communicator for correct data communication and this is challenging as current Julia C embeddings are not able to directly pass the MPI communicator. To address this challenge, we have developed two converters (i.e., in both C and Fortran formats) of MPI communicators for supporting different MPI libraries. Last, we have also evaluated the performance (i.e., overall overhead) of the worker function for providing valuable guidelines of running HPC applications with Julia at large-scale.
Dr Tang is a research scientist in the CCS-7 Programming Models team at Los Alamos National Laboratory. His research interests include programming model, co-design, performance/energy analysis and modeling. Dr Tang received his Ph.D. from the Department of Computer Science and Engineering at University of Notre Dame in 2017.
Group leader for Statistical Sciences at Los Alamos National Laboratory