2025-04-24 –, Hassium
Many data and simulation scientists use NumPy for its ease of use and good performance on CPU. This approach works well for single-node tasks, but scaling to handle larger datasets or more resource-intensive computations introduces significant challenges. Not to mention, using GPUs requires another level of complexity. We present the cuPyNumeric library, which gives developers the same familiar NumPy interface, but seamlessly distributes work across CPUs and GPUs.
In this talk we showcase the productivity and performance of cuPyNumeric library on one of the user's examples covering some detail on its implementation.
Many data and simulation scientists use NumPy for its ease of use and good performance on CPU. This approach works well for single-node tasks, but scaling to handle larger datasets or more resource-intensive computations introduces significant challenges. Not to mention, using GPUs requires another level of complexity. We present the cuPyNumeric library. cuPyNumeric gives developers the same familiar NumPy interface, but seamlessly distributes work across CPUs and GPUs.
A compelling example when scaling is necessary is when scientists at the Stanford Linear Accelerator Center(SLAC) need to process a large amount of data within a fixed time window, called beam time. The full dataset generated during experiments is too large to be processed on a single CPU. Additionally, the code often must be modified during the beam time to adapt to changing experimental needs. Being able to use NumPy syntax rather than lower level distributed computing libraries makes these changes quick and easy, allowing researchers to focus on conducting more experiments rather than debugging or optimizing code.
cuPyNumeric is designed to be a drop-in replacement to NumPy. Built on top of task-based distributed runtime from Stanford University, it automatically parallelizes NumPy APIs across all available resources, taking care of data distribution, communication, asynchronous and accelerated execution of compute kernels on both GPUs or multi-core CPUs. In addition, cuPyNumeric can be integrated with other popular Python libraries like SciPy, matplotlib, Jax. With cuPyNumeric, SLAC scientists successfully ran their data processing code distributed across multiple nodes and GPUs, processing the full dataset with a 6x speed-up compared to the original single-node implementation.
In this talk we showcase the productivity and performance of cuPyNumeric library covering some detail on its implementation.
Novice
Expected audience expertise: Python:None
Public link to supporting material, e.g. videos, Github, etc.:Irina Demeshko is a senior software engineer at NVIDIA working on cuNumeric and Legate projects. Before NVIDIA, Irina was a research scientist and team leader of the Co-Design team at the Los Alamos National Laboratory. Her work and research interests are in the area of new HPC technologies and programming models. Irina received her Ph.D. in mathematical and computer science from the Tokyo Institute of Technology in 2013.
Quynh L. Nguyen is an Associate Scientist at SLAC National Accelerator Laboratory. She completed her PhD in Physics at JILA-University of Colorado Boulder (2020) and postdoctoral work at SLAC National Accelerator Laboratory and Stanford University as a Q-FARM Bloch Fellow. Her research interests include nonequilibrium dynamics of materials and high performance computing with applications towards quantum information science.