EuroSciPy 2024

Skrub: prepping tables for machine learning
2024-08-28 , Room 7

When it comes to designing machine learning predictive models, it is reported that data scientists spend over 80% of their time preparing the data to input to the machine learning algorithm.

Currently, no automated solution exists to address this problem. However, the skrub Python library is here to alleviate some of the daily tasks of data scientists and offer an integration with the scikit-learn machine learning library.

In this talk, we provide an overview of the features available in skrub.

First, we focus on the preprocessing stage closest to the data sources. While predictive models usually expect a single design matrix and a target vector (or matrix), in practice, it is common that data are available from different data tables. It is also possible that the data to be merged are slightly different, making it difficult to join them. We will present the skrub joiners that handle such use cases and are fully compatible with scikit-learn and its pipeline.

Then, another issue widely tackled by data scientists is dealing with heterogeneous data types (e.g., dates, categorical, numerical). We will present the TableVectorizer, a preprocessor that automatically handles different types of encoding and transformation, reducing the amount of boilerplate code to write when designing predictive models with scikit-learn. Like the joiner, this transformer is fully compatible with scikit-learn.


When it comes to designing machine learning predictive models, it is reported that data scientists spend over 80% of their time preparing the data to input to the machine learning algorithm.

Currently, no automated solution exists to address this problem. However, the skrub Python library is here to alleviate some of the daily tasks of data scientists and offer an integration with the scikit-learn machine learning library.

In this talk, we provide an overview of the features available in skrub.

First, we focus on the preprocessing stage closest to the data sources. While predictive models usually expect a single design matrix and a target vector (or matrix), in practice, it is common that data are available from different data tables. It is also possible that the data to be merged are slightly different, making it difficult to join them. We will present the skrub joiners that handle such use cases and are fully compatible with scikit-learn and its pipeline.

Then, another issue widely tackled by data scientists is dealing with heterogeneous data types (e.g., dates, categorical, numerical). We will present the TableVectorizer, a preprocessor that automatically handles different types of encoding and transformation, reducing the amount of boilerplate code to write when designing predictive models with scikit-learn. Like the joiner, this transformer is fully compatible with scikit-learn.


Abstract as a tweet:

Skrub: prepping tables for your machine learning pipeline

Category [Machine and Deep Learning]:

Supervised Learning

Expected audience expertise: Domain:

some

Expected audience expertise: Python:

some

Public link to supporting material:

https://github.com/skrub-data/skrub

Project Homepage / Git:

https://skrub-data.org/stable/

I'm an open source software engineer at :probabl. I'm a core developer of scikit-learn and `imbalanced-learn.

This speaker also appears in:

I am a software engineer at INRIA, working mostly on the Skrub open-source Python library (https://skrub-data.org/).