Guillaume Lemaitre
Guillaume is an open-source software engineer working at :probabl. He is a core maintainer of the scikit-learn and imbalanced-learn libraries.
Sessions
Skrub is a package that eases preparing dataframes so they can be used in machine-learning tasks. In practice, data can be spread over multiple tables, represent various types of information (tabular, textual, graphical), or be stored on external database systems rather than dataframes.
Skrub Data Ops help with constructing versatile pipelines that can handle this variety of scenarios, while at the same time avoiding data leakage and allowing to build rich hyper-parameter grids that can be explored to maximize the performance of the final machine learning model.
In this talk, we give a brief introduction of the Data Ops framework before presenting three separate use cases highlighting their versatility: a traditional machine learning pipeline that uses Optuna to perform hyper-parameter tuning, a pipeline that trains on data stored in a relational database rather than a dataframe, and an image classification task with Pytorch.
By the end of the talk, attendees will learn about the skrub Data Ops, their main features and how they can be used successfully in different practical scenarios.
Class imbalance is a common challenge in real-world machine learning. This course explores why standard approaches fail and how to build reliable classifiers using scikit-learn's calibration and threshold-tuning tools.
We cover practical solutions including resampling strategies, probabilistic calibration with CalibratedClassifierCV, and decision threshold optimization using TunedThresholdClassifierCV. You'll learn to evaluate models appropriately with calibration curves and confusion matrices.
The course also addresses prevalence shift or in other words when your training data doesn't reflect the target population. We demonstrate weight-based training corrections and post-hoc probability adjustments applicable to any binary classifier.