Exascale simulations, like those of interest to the Department of Energy, will produce more data than can be stored for post hoc statistical analysis. To address this, we are using Julia as an HPC-embedded language to develop workflows for statistical analysis inside simulations as they run. Our project, PRISM, is being developed to fit Bayesian hierarchical models to answer scientific questions using the high resolution data that will be produced on the next generation of supercomputers.
High performance computing (HPC) has enabled predictive simulation models that have revolutionized physics, chemistry, biology, and other sciences. Modern supercomputers with exascale performance (one billion billion calculations per second) will create massive new quantities of simulation data. Saving this data has become a computational bottleneck: because storage is much slower than processors, only a tiny fraction of data generated by an exascale simulation can be saved to disk for later statistical analysis. Larger supercomputers will simulate physical systems with unprecedented detail, opening new avenues for scientific advancement, but as computational speed increases, a decreasing fraction of simulation data can be analyzed, limiting the scientific gains that can be realized.
In response to this barrier, “in-situ” approaches have emerged to analyze data within the simulation itself, as the data are being generated, without needing to first save them to disk. Most of these approaches focus on compressing data by extracting scientifically informative features, such as the locations of interesting events within the simulation. However, they do not address the broader problem of inference: fitting statistical models, such as large-scale regressions, to spatial and time series data. A large body of advanced statistical analysis techniques therefore cannot be applied to exascale simulation data.
We have begun developing the algorithms and software tools to fit sophisticated statistical models in-situ. We are using Julia to implement new statistical algorithms that can analyze scientific data efficiently on exascale supercomputers, dealing with unique computational challenges presented by in-situ analysis. These challenges include distributing the calculation of data correlations in parallel across many computational nodes while minimizing slow inter-node communication; analyzing data one piece at a time when they are too large to hold in memory at once; and constructing faster algorithms to fit statistical models that will not slow down the simulation while analysis
occurs. We are applying these new analysis techniques in the context of modeling extreme events in climate and space weather simulations.
Our project’s primary deliverable is the Programming Repository for In Situ Modeling (PRISM). PRISM is a set of tools for fitting statistics and machine learning models to simulation data inside the simulations as they are running. The tools are designed to implement a wide variety of data analyses with an emphasis on spatiotemporal hierarchical Bayesian models. PRISM is intended to be efficient, scalable, and streaming with estimation based on variational inference, advanced Monte Carlo techniques, and fast optimization methods. The core modeling components aid this goal by imposing sparsity and approximate inference wherever possible. All of these components are being written in Julia to enable data scientists to write code in a high-level programming language that is performant enough to keep up with the C/C++ and Fortran codes it will be embedded within. PRISM also contains tools for interfacing with large-scale scientific simulations written in Fortran and C/C++. This layer of abstraction allows the data scientist to construct analysis models in Julia without concern for the implementation details of the simulation capability. With these components, PRISM can be used to unlock the full scientific potential of next-generation HPC simulations.
Thus far, we have implemented methods for distributed sparse Gaussian processes, Gaussian mixture modeling for clustering, and extreme value modeling. We have also implemented fast estimation techniques based on sequential variational inference (an approximation scheme for estimating Bayesian posterior distribution) and deep neural networks to optimize parameters of Gaussian process models. This work also includes domain-specific analyses for the detection of sudden stratospheric warming in a climate simulation and the identification of fast plasma flow channels in a space weather simulation. These techniques have been demonstrated on high-performance computing hardware at Los Alamos National Laboratory.