2021-07-30 –, Red
We present Redwood, a Julia framework for clusterless supercomputing in the cloud. Redwood provides a set of distributed programming macros that enable users to remotely execute Julia functions in parallel through cloud services for batch and serverless computing. We present the architecture and design of Redwood, as well as its application to existing Julia packages for machine learning and inverse problems.
Through the rise in popularity of deep learning and large-scale numerical simulations, high-performance computing (HPC) has entered the mainstream of scientific computing. Today, HPC techniques are increasingly required by a wider and wider audience, in fields including machine and deep learning, weather forecasting, medical and seismic imaging, computational genomics, fluid dynamics and others. HPC workloads have been traditionally deployed to on-premise high-performance computing clusters and were therefore only available to a very limited number of researchers or corporations. With the rise of cloud computing, HPC resources have in principle become available to a much wider audience but managing HPC infrastructure in the cloud is challenging. As the cloud provides a fundamentally different computing infrastructure from on-premise supercomputer, users need build environments and applications that are resilient are cost efficient and that are able to leverage cloud-related opportunities such as elastic (hyper-scale) compute and heterogeneous infrastructure.
Naturally, the current approach to port HPC applications to the cloud is to replicate the infrastructure of on-premise supercomputing centers with cloud resources. Cloud services such as AWS ParallelCluster or Azure CycleCloud enable users to create virtual HPC clusters that consist of login nodes, job schedulers, a set of compute instances, networking and distributed storage systems. Even cloud-native approaches such as Kubernetes follow this cluster-based architecture, albeit using containerization and novel schedulers. However, from the user side both approaches are a two-step approach in which users first create a (virtual) HPC cluster in the cloud and then submit their parallel program to the cluster. This makes running HPC applications in the cloud challenging, as users have to act as cluster administrators who manage the HPC infrastructure before being able to run their application.
In this work, we argue for the case of clusterless supercomputing in the cloud in which the user application essentially takes over the role of the job scheduler and cluster orchestrator. Instead of a
two-step process in which users first create a cluster and then submit their job to it, the application is executed anywhere and dynamically manages the required compute infrastructure at runtime. To enable this type of clusterless HPC which is heavily inspired by serverless orchestration frameworks, we introduce Redwood, an open-source software package for clusterless supercomputing on the Azure cloud. Redwood provides a set of distributed programming macros that are designed in accordance with Julia's existing macros for distributed computing around the principles of remote function calls and futures. Unlike Julia's standard distributed computing framework, Redwood does not require a parallel Julia session that is running on a set of interconnected nodes (i.e., a cluster). Instead, Redwood executes functions that are tagged for remote (parallel) execution via cloud services such as Azure batch or Azure Functions by creating a closure around the executed code and running it remotely through the respective cloud service. Results, namely function outputs, are written to cloud object stores and remote references are returned to the user.
In this talk, we discuss the architecture and implementation of Redwood and present HPC scenarios that are enable by it. This includes large-scale MapReduce workloads, computations that are distributed across multiple data centers or even regions, as well as combinations of data and model parallel applications in which users can execute multiple distributed-memory MPI workloads in parallel. Additionally, we present how existing Julia packages such as Flux or JUDI (a framework for PDE-constrained optimization) can be cloud-natively deployed through Redwood, without requiring users to set up HPC clusters.
Philipp A. Witte is a researcher at Microsoft Research for Industry (RFI), a new initiative within Microsoft for developing innovative research solutions for industry-related problems ranging from AI/ML to edge- and high-performance computing. Prior to Microsoft, Philipp received his B.Sc. and M.Sc. in Geophysics from the University of Hamburg and his Ph.D. in Computational Science and Engineering from the Georgia Institute of Technology. During his Ph.D., Philipp worked with Professor Felix J. Herrmann at the Seismic Laboratory for Imaging and Modeling (SLIM) on computational aspects of least squares seismic imaging and full-waveform inversion. He has authored and contributed to multiple open-source software packages, including Devito, the Julia Devito Inversion framework (JUDI) and InvertibleNetworks.jl, a Julia framework for deep learning with normalizing flows.