PyData Boston 2025

Accelerating Geospatial Analysis with GPUs
2025-12-10 , Abigail Adams

Geospatial analysis often relies on raster data, n‑dimensional arrays where each cell holds a spatial measurement. Many raster operations, such as computing indices, statistical analysis, and classification, are naturally parallelizable and ideal for GPU acceleration.

This talk demonstrates an end‑to‑end GPU‑accelerated semantic segmentation pipeline for classifying satellite imagery into multiple land cover types. Starting with cloud-hosted imagery, we will process data in chunks, compute features, train a machine learning model, and run large-scale predictions. This process is accelerated with the open-source RAPIDS ecosystem, including Xarray, cuML, and Dask, often requiring only minor changes to familiar data science workflows.

Attendees who work with raster data or other parallelizable, computationally intensive workflows will benefit most from this talk, which focuses on GPU acceleration techniques. While the talk draws from geospatial analysis, key geospatial concepts will be introduced for beginners. The methods demonstrated can be applied broadly across domains to accelerate large-scale data processing.


This informative, example-driven talk is ideal for developers/data scientists who work closely with geospatial data. Prior experience is not required as we will provide a good overview of the essential concepts for everyone to follow. The talk focuses on GPU acceleration and the performance gains when working with n-dimensional arrays, and demonstrating how one can leverage it for their own workflows.\
All demo code will be made available on a Github repository\
Outline and Time breakdown:

  • Introduction (5 mins)

  • An introduction to raster data, its structure and land use land cover (LULC) classification using semantic segmentation.

  • Explanation of why raster operations are GPU friendly. 

  • An introduction to the publicly available datasets being used in this talk (Sentinel-2 + ESA WorldCover)

  • Overview of common raster operations (5 mins)

  • Show data ingestion from rasters stored in Cloud Optimized Geotiff (COG) format on the cloud directly into Zarr/Xarray

  • Calculate new features such as vegetation indices (NDVI) using Xarray+Dask+cupy to stream chunks into the GPU memory, often scaling to multi-GPU setups

  • Training a Random Forest model (10 mins)

  • Use cuML to train a Random Forest model across millions of pixels, either by pixels sampling or using Dask to scale to a multi-GPU setup using Dask-cuML

  • Demonstrate tiled inference with outputs written back as COGs on the cloud. 

  • Comparison against CPU baseline and GPU options (10 mins)

  • Use scikit-learn as a CPU baseline to provide reference timings, enabling end-to-end wall-time comparisons for feature computation, training, and full-scene inference.

  • Best practices and pointers towards GPU configurations, such as chunk sizing and memory pools. 

  • Some limitations and alternatives (5 mins)

  • Some scenarios where using GPUs will not provide the best results (tiny datasets, heavy I/O constraints) etc

  • Using Deep Learning frameworks as an alternative to the Random Forest model used in this talk with a comparison of the two approaches

  • Q/A (5 mins)


Prior Knowledge Expected: No previous knowledge expected

Jacob Tomlinson is a senior Python software engineer at NVIDIA with a focus on deployment tooling for distributed systems. His work involves maintaining open source projects including RAPIDS and Dask. RAPIDS is a suite of GPU accelerated open source Python tools which mimic APIs from the PyData stack including those of Numpy, Pandas and SciKit-Learn. Dask provides advanced parallelism for analytics with out-of-core computation, lazy evaluation and distributed execution of the PyData stack. He also tinkers with the open source Kubernetes Python framework kr8s in his spare time. Jacob volunteers with the local tech community group Tech Exeter and lives in Exeter, UK.

Naty Clementi is a senior software engineer at NVIDIA. She is a former academic with a Masters in Physics and PhD in Mechanical and Aerospace Engineering to her name. Her work involves contributing to RAPIDS, and in the past she has also contributed and maintained other open source projects such as Ibis and Dask. She is an active member of PyLadies and an active volunteer and organizer of Women and Gender Expansive Coders DC meetups.

Jaya Venkatesh is a software engineer at NVIDIA, working on the RAPIDS ecosystem with a focus on simplifying deployment in the cloud and distributed systems. Previously, Jaya worked as a machine learning engineer at Pixxel Space, where he developed large scale, real-time inferencing models for Earth Observation. He holds a Master’s degree in Computer Science from Arizona State University, where his research project centered on snow melt monitoring in the Arizona region through satellite imagery analysis.