Probabilistic classification and cost-sensitive learning with scikit-learn EuroSciPy 2024

Probabilistic classification and cost-sensitive learning with scikit-learn
.ical

2024-08-26 14:00–15:30, Room 5

Data scientists are repeatedly told that it is absolutely critical to align their model training methodology with a specific business objective. While being a rather good advice, it usually falls short on details on how to achieve this in practice.

This hands-on tutorial aims to introduce helpful theoretical concepts and concrete software tools to help them bridge this gap. This method will be illustrated on a worked practical use case: optimizing the operations of a fraud detection system for a payment processing platform.

More specifically, we will introduce the concepts of calibrated probabilistic classifiers, how to evaluate them and fix common causes of mis-calibration. In a second part, we will explore how to turn probabilistic classifiers into optimal business decision makers.

The tutorial material is available at the following URL: https://github.com/probabl-ai/calibration-cost-sensitive-learning

Detailed outline of the tutorial:

Introduction
- Evaluting ML based predictions with:
  - ranking metrics,
  - probabilistic metrics,
  - decision metrics.
- Proper scoring losses and their decomposition in:
  - calibration loss,
  - grouping loss,
  - irreducible loss.
Part I: Probabilistic classification
- The calibration curve
- Possible causes of miscalibration
  - Model misspecification
  - Overfitting and bad level of regularization
- Possible ways to improve calibration
  - Non-linear feature engineering to avoid misspecification
  - Post-hoc calibration with Isotonic regression
  - Tuning parameters and early stopping with a proper-scoring rule
Part II: Optimal decision making under uncertainty
- Defining a custom business cost functions
- Individual-specific cost functions
- Setting the Elkan-optimal threshold with FixedThresholdClassifier
- Cost-sensitive learning for arbitrary cost functions with TunedThresholdClassifierCV
- Predict-time decision threshold optimization.

This tutorial will be delivered as a set of publicly available Jupyter notebooks under an open source license.

We will mostly use components of the latest version of the scikit-learn library + a few custom extensions.

The tutorial material is available at the following URL: https://github.com/probabl-ai/calibration-cost-sensitive-learning

Abstract as a tweet:

Probabilistic classification and cost-sensitive learning with scikit-learn. Learn the power of hparam tuning with proper scoring rules and optimal decision thresold tuning on custom business rules.

Category [Machine and Deep Learning]:

Supervised Learning

Expected audience expertise: Domain:

expert

Expected audience expertise: Python:

some

Public link to supporting material:

https://scikit-learn.org/stable/auto_examples/model_selection/plot_cost_sensitive_learning.html

Project Homepage / Git:

https://scikit-learn.org

Guillaume Lemaitre

I'm an open source software engineer at :probabl. I'm a core developer of scikit-learn and `imbalanced-learn.

This speaker also appears in:

Olivier Grisel

Olivier is a software engineer at Probabl and a core contributor to the scikit-learn open source Machine Learning library.

https://sigmoid.social/@ogrisel

Probabilistic classification and cost-sensitive learning with scikit-learn .ical 2024-08-26 14:00–15:30, Room 5

Probabilistic classification and cost-sensitive learning with scikit-learn
.ical

2024-08-26 14:00–15:30, Room 5