EuroSciPy 2024

Guillaume Lemaitre

I'm an open source software engineer at :probabl. I'm a core developer of scikit-learn and `imbalanced-learn.


Institute / Company

:probabl.

Homepage

https://glemaitre.github.io/

Twitter handle

@glemaitre58

Git*hub|lab

https://github.com/glemaitre


Sessions

08-26
14:00
90min
Probabilistic classification and cost-sensitive learning with scikit-learn
Guillaume Lemaitre, Olivier Grisel

Data scientists are repeatedly told that it is absolutely critical to align their model training methodology with a specific business objective. While being a rather good advice, it usually falls short on details on how to achieve this in practice.

This hands-on tutorial aims to introduce helpful theoretical concepts and concrete software tools to help them bridge this gap. This method will be illustrated on a worked practical use case: optimizing the operations of a fraud detection system for a payment processing platform.

More specifically, we will introduce the concepts of calibrated probabilistic classifiers, how to evaluate them and fix common causes of mis-calibration. In a second part, we will explore how to turn probabilistic classifiers into optimal business decision makers.

The tutorial material is available at the following URL: https://github.com/probabl-ai/calibration-cost-sensitive-learning

Machine and Deep Learning
Room 5
08-28
13:20
30min
Skrub: prepping tables for machine learning
Guillaume Lemaitre, Vincent Maladiere, Jérôme Dockès

When it comes to designing machine learning predictive models, it is reported that data scientists spend over 80% of their time preparing the data to input to the machine learning algorithm.

Currently, no automated solution exists to address this problem. However, the skrub Python library is here to alleviate some of the daily tasks of data scientists and offer an integration with the scikit-learn machine learning library.

In this talk, we provide an overview of the features available in skrub.

First, we focus on the preprocessing stage closest to the data sources. While predictive models usually expect a single design matrix and a target vector (or matrix), in practice, it is common that data are available from different data tables. It is also possible that the data to be merged are slightly different, making it difficult to join them. We will present the skrub joiners that handle such use cases and are fully compatible with scikit-learn and its pipeline.

Then, another issue widely tackled by data scientists is dealing with heterogeneous data types (e.g., dates, categorical, numerical). We will present the TableVectorizer, a preprocessor that automatically handles different types of encoding and transformation, reducing the amount of boilerplate code to write when designing predictive models with scikit-learn. Like the joiner, this transformer is fully compatible with scikit-learn.

Machine and Deep Learning
Room 7
08-29
13:20
100min
Dispatching, Backend Selection, and Compatibility APIs
Guillaume Lemaitre, Joris Van den Bossche, Tim Head, Erik Welch, Marco Gorelli, Sebastian Berg, Aditi Juneja, Stéfan van der Walt

Scientific python libraries struggle with the existence of several array and dataframe providers. Many important libraries currently mainly support NumPy arrays or pandas dataframes.
However, as library authors we wish to allow users to smoothly use other array provides and simplify for example the use of GPUs without the need for explicit use of cuda enabled libraries.

This session will be split into three related discussions around efforts to tackle this situation:
* Dispatching and backend selection discussion
* Array API adoption progress and discussion
* Dataframe compatibility layer discussion

High Performance Computing
Room 5