2024-08-28 –, Room 7
When it comes to designing machine learning predictive models, it is reported that data scientists spend over 80% of their time preparing the data to input to the machine learning algorithm.
Currently, no automated solution exists to address this problem. However, the skrub
Python library is here to alleviate some of the daily tasks of data scientists and offer an integration with the scikit-learn
machine learning library.
In this talk, we provide an overview of the features available in skrub
.
First, we focus on the preprocessing stage closest to the data sources. While predictive models usually expect a single design matrix and a target vector (or matrix), in practice, it is common that data are available from different data tables. It is also possible that the data to be merged are slightly different, making it difficult to join them. We will present the skrub
joiners that handle such use cases and are fully compatible with scikit-learn
and its pipeline.
Then, another issue widely tackled by data scientists is dealing with heterogeneous data types (e.g., dates, categorical, numerical). We will present the TableVectorizer
, a preprocessor that automatically handles different types of encoding and transformation, reducing the amount of boilerplate code to write when designing predictive models with scikit-learn
. Like the joiner, this transformer is fully compatible with scikit-learn
.
When it comes to designing machine learning predictive models, it is reported that data scientists spend over 80% of their time preparing the data to input to the machine learning algorithm.
Currently, no automated solution exists to address this problem. However, the skrub
Python library is here to alleviate some of the daily tasks of data scientists and offer an integration with the scikit-learn
machine learning library.
In this talk, we provide an overview of the features available in skrub
.
First, we focus on the preprocessing stage closest to the data sources. While predictive models usually expect a single design matrix and a target vector (or matrix), in practice, it is common that data are available from different data tables. It is also possible that the data to be merged are slightly different, making it difficult to join them. We will present the skrub
joiners that handle such use cases and are fully compatible with scikit-learn
and its pipeline.
Then, another issue widely tackled by data scientists is dealing with heterogeneous data types (e.g., dates, categorical, numerical). We will present the TableVectorizer
, a preprocessor that automatically handles different types of encoding and transformation, reducing the amount of boilerplate code to write when designing predictive models with scikit-learn
. Like the joiner, this transformer is fully compatible with scikit-learn
.
Skrub: prepping tables for your machine learning pipeline
Category [Machine and Deep Learning] –Supervised Learning
Expected audience expertise: Domain –some
Expected audience expertise: Python –some
Public link to supporting material – Project Homepage / Git –I'm an open source software engineer at :probabl. I'm a core developer of scikit-learn
and `imbalanced-learn.
I am a software engineer at INRIA, working mostly on the Skrub open-source Python library (https://skrub-data.org/).