PyLadiesCon 2024

Your locale preferences have been saved. We like to think that we have excellent support for English in pretalx, but if you encounter issues or errors, please contact us!

The adoption of the Array API standard in scikit-learn and how to use it
2024-12-07 , Main Stream
Language: English

This talk will explore the adoption of the Array API standard in scikit-learn. We'll begin with a brief introduction to the Array API standard and its benefits, focusing on its impact on cross-library functions and model training speed. Next, we'll examine a detailed example of a scikit-learn kernel that shows significant performance improvements due to the Array API. Currently in scikit-learn, users can choose to use Array API by enabling or disabling a flag called "array_api_dispatch." We'll demonstrate how to turn on this flag during the tutorial and showcase the resulting performance changes. By the end of this talk, attendees will have a high-level understanding of the Array API standard and know how to use it in their own scikit-learn projects.


The talk will begin with a personal introduction, detailing my background, current role, and involvement with scikit-learn. I'll then provide a brief overview of the Array API standard and direct attendees to the official page for more information. Next, I'll explain the benefits of adopting the Array API standard, including GPU utilization with Torch or CuPy (as opposed to just NumPy), JIT compiler advantages, and distributed computing capabilities with DASK. However, I'll note that scikit-learn is currently focused primarily on GPU utilization.

The core of this presentation will be a code example demonstrating the significant increase in processing speed when enabling the array_api_dispatch flag. I'll start by showing and explaining an estimator that supports Array API. Then, I'll train a model using this estimator without the array_api_dispatch flag (CPU training), using time.perf_counter to measure the fitting duration. Following this, I'll enable the flag and retrain the model (GPU training), again tracking the time. We'll compare the training times with and without the flag enabled, highlighting the substantial speed improvement with array_api_dispatch turned on.

To conclude, I'll showcase a meta issue in the scikit-learn GitHub repository that tracks Array API-related issues, providing attendees with a resource for further exploration and potential contribution.

Emily is a scikit-learn enthusiast at Probabl, an open source contributor, an advocate for diversity and inclusion on STEM, and an engineering student at the University of Toronto with an emphasis on biomedical applications.