EuroSciPy 2025

Guillaume Lemaitre

I'm chief machine learning officer and open source software engineer at :probabl. I'm a core developer of scikit-learn and imbalanced-learn.


Photo euroscipy-2025/question_uploads/picture_xnYaJSR.jpg Affiliation

Probabl

Position / Job

Chief machine learning officer & open source software engineer

Homepage

https://glemaitre.github.io/

GitHub/GitLab profile URL

https://github.com/glemaitre

LinkedIn

https://www.linkedin.com/in/guillaume-lemaitre-b9404939/


Sessions

08-18
10:30
90min
Skrub: machine learning for dataframes
Guillaume Lemaitre, Jérôme Dockès, Riccardo Cappuzzo

Machine-learning algorithms expect a numeric array with one row per observation. Typically, creating this table requires "wrangling" with Pandas or Polars (aggregations, selections, joins, ...), and to extract numeric features from structured data types such as datetimes. These transformations must be applied consistently when making predictions for unseen inputs, and choices must be informed by performance measured on a validation dataset, while preventing data leakage. This preprocessing is the most difficult and time-consuming part of many data-science projects.

Skrub bridges the gap between complex tabular data stored in Pandas or Polars dataframes, and machine-learning algorithms implemented by scikit-learn estimators. It provides scikit-learn transformers to extract features from datetimes, (fuzzy) categories and text, and to perform data-wrangling such as joins and aggregations in a learning pipeline. Its pre-built, flexible learners offer very robust performance on many tabular datasets without manual tweaking. It can create complex pipelines that handle multiple tables, while easily describing and searching rich hyperparameter spaces. As interactivity and visualization are essential for preprocessing, Skrub also provides an interactive report to explore a dataframe, and its pipelines can be built incrementally while inspecting intermediate results.

We will give an overview of Skrub and demonstrate its features on realistic and challenging tabular learning scenarios

Computational Tools and Scientific Python Infrastructure
Large Room
08-19
08:30
90min
Predictive Modeling with Imbalanced Datasets Using Scikit-learn
Guillaume Lemaitre, Olivier Grisel

Real-world applications use machine learning to aid decision-making and planning. Data scientists employ probabilistic models to connect input data with outcome predictions that guide operational decisions. A common challenge is working with "imbalanced" datasets, where the outcome of interest occurs rarely compared to total observations. Examples include disease detection in medical screening, fraud identification in transactions, and discovery of rare physical phenomena like the Higgs boson.

This tutorial examines methodological considerations for handling imbalanced datasets. We focus on resampling techniques that adjust the ratio between positive and negative outcomes. The tutorial explores: (i) how imbalanced data affects probability outcomes and classifier calibration; (ii) resampling's impact on model overfitting/underfitting and its connection to regularization; and (iii) the tradeoffs between computational and statistical performance when implementing resampling strategies.

Hands-on programmatic notebooks provide practical insights into these concepts.

Applied AI & LLM Technologies and Use Cases
Large Room