EuroSciPy 2025

Olivier Grisel

Olivier Grisel is a machine learning engineer at Probabl and a contributor to the scikit-learn library.


Session

08-19
08:30
90min
Predictive modeling for imbalanced classification using scikit-learn
Guillaume Lemaitre, Olivier Grisel

Real-world applications use machine learning to aid decision-making and planning. Data scientists employ probabilistic models to connect input data with outcome predictions that guide operational decisions. A common challenge is working with "imbalanced" datasets, where the outcome of interest occurs rarely compared to total observations. Examples include disease detection in medical screening, fraud identification in transactions, and discovery of rare physical phenomena like the Higgs boson.

This tutorial examines methodological considerations for handling imbalanced datasets. We focus on resampling techniques that adjust the ratio between positive and negative outcomes. The tutorial explores: (i) how imbalanced data affects probability outcomes and classifier calibration; (ii) resampling's impact on model overfitting/underfitting and its connection to regularization; and (iii) the tradeoffs between computational and statistical performance when implementing resampling strategies.

Hands-on programmatic notebooks provide practical insights into these concepts.

The material and instructions to follow the tutorial will be available here:
https://github.com/probabl-ai/calibration-cost-sensitive-learning

Applied AI & LLM Technologies and Use Cases
Room 1.38 (Ground Floor)