Olivier Grisel EuroSciPy 2025

Olivier Grisel
.ical

Olivier Grisel is a machine learning engineer at Probabl and a contributor to the scikit-learn library.

Session

Predictive Modeling with Imbalanced Datasets Using Scikit-learn

Real-world applications use machine learning to aid decision-making and planning. Data scientists employ probabilistic models to connect input data with outcome predictions that guide operational decisions. A common challenge is working with "imbalanced" datasets, where the outcome of interest occurs rarely compared to total observations. Examples include disease detection in medical screening, fraud identification in transactions, and discovery of rare physical phenomena like the Higgs boson.

This tutorial examines methodological considerations for handling imbalanced datasets. We focus on resampling techniques that adjust the ratio between positive and negative outcomes. The tutorial explores: (i) how imbalanced data affects probability outcomes and classifier calibration; (ii) resampling's impact on model overfitting/underfitting and its connection to regularization; and (iii) the tradeoffs between computational and statistical performance when implementing resampling strategies.

Hands-on programmatic notebooks provide practical insights into these concepts.

Applied AI & LLM Technologies and Use Cases

Room 1.38 (Ground Floor)

Olivier Grisel .ical

Session

Olivier Grisel
.ical