Outgrowing your node? Zero stress scaling with cuPyNumeric. PyCon DE & PyData 2025

Outgrowing your node? Zero stress scaling with cuPyNumeric.
.ical

2025-04-24 10:55–11:25, Hassium

Many data and simulation scientists use NumPy for its ease of use and good performance on CPU. This approach works well for single-node tasks, but scaling to handle larger datasets or more resource-intensive computations introduces significant challenges. Not to mention, using GPUs requires another level of complexity. We present the cuPyNumeric library, which gives developers the same familiar NumPy interface, but seamlessly distributes work across CPUs and GPUs.
In this talk we showcase the productivity and performance of cuPyNumeric library on one of the user's examples covering some detail on its implementation.

Many data and simulation scientists use NumPy for its ease of use and good performance on CPU. This approach works well for single-node tasks, but scaling to handle larger datasets or more resource-intensive computations introduces significant challenges. Not to mention, using GPUs requires another level of complexity. We present the cuPyNumeric library. cuPyNumeric gives developers the same familiar NumPy interface, but seamlessly distributes work across CPUs and GPUs.

A compelling example when scaling is necessary is when scientists at the Stanford Linear Accelerator Center(SLAC) need to process a large amount of data within a fixed time window, called beam time. The full dataset generated during experiments is too large to be processed on a single CPU. Additionally, the code often must be modified during the beam time to adapt to changing experimental needs. Being able to use NumPy syntax rather than lower level distributed computing libraries makes these changes quick and easy, allowing researchers to focus on conducting more experiments rather than debugging or optimizing code.

cuPyNumeric is designed to be a drop-in replacement to NumPy. Built on top of task-based distributed runtime from Stanford University, it automatically parallelizes NumPy APIs across all available resources, taking care of data distribution, communication, asynchronous and accelerated execution of compute kernels on both GPUs or multi-core CPUs. In addition, cuPyNumeric can be integrated with other popular Python libraries like SciPy, matplotlib, Jax. With cuPyNumeric, SLAC scientists successfully ran their data processing code distributed across multiple nodes and GPUs, processing the full dataset with a 6x speed-up compared to the original single-node implementation.

In this talk we showcase the productivity and performance of cuPyNumeric library covering some detail on its implementation.

Expected audience expertise: Domain:

Novice

Expected audience expertise: Python:

None

Public link to supporting material, e.g. videos, Github, etc.:

https://github.com/nv-legate/cupynumeric

Bo Dong

Bo Dong is a Principal Technical Product Manager on the CUDA team. He is responsible for distributed computing including Legate and other technologies/products at NVIDIA.

Outgrowing your node? Zero stress scaling with cuPyNumeric. .ical 2025-04-24 10:55–11:25, Hassium

Outgrowing your node? Zero stress scaling with cuPyNumeric.
.ical

2025-04-24 10:55–11:25, Hassium