Vincent Maladiere EuroSciPy 2024

Vincent Maladiere
.ical

Session

Skrub: prepping tables for machine learning

Guillaume Lemaitre, Vincent Maladiere, Jérôme Dockès

When it comes to designing machine learning predictive models, it is reported that data scientists spend over 80% of their time preparing the data to input to the machine learning algorithm.

Currently, no automated solution exists to address this problem. However, the skrub Python library is here to alleviate some of the daily tasks of data scientists and offer an integration with the scikit-learn machine learning library.

In this talk, we provide an overview of the features available in skrub.

First, we focus on the preprocessing stage closest to the data sources. While predictive models usually expect a single design matrix and a target vector (or matrix), in practice, it is common that data are available from different data tables. It is also possible that the data to be merged are slightly different, making it difficult to join them. We will present the skrub joiners that handle such use cases and are fully compatible with scikit-learn and its pipeline.

Then, another issue widely tackled by data scientists is dealing with heterogeneous data types (e.g., dates, categorical, numerical). We will present the TableVectorizer, a preprocessor that automatically handles different types of encoding and transformation, reducing the amount of boilerplate code to write when designing predictive models with scikit-learn. Like the joiner, this transformer is fully compatible with scikit-learn.

Machine and Deep Learning

Room 7

Vincent Maladiere .ical

Session

Vincent Maladiere
.ical