EuroSciPy 2024

Probabilistic classification and cost-sensitive learning with scikit-learn
2024-08-26 , Room 5

Data scientists are repeatedly told that it is absolutely critical to align their model training methodology with a specific business objective. While being a rather good advice, it usually falls short on details on how to achieve this in practice.

This hands-on tutorial aims to introduce helpful theoretical concepts and concrete software tools to help them bridge this gap. This method will be illustrated on a worked practical use case: optimizing the operations of a fraud detection system for a payment processing platform.

More specifically, we will introduce the concepts of calibrated probabilistic classifiers, how to evaluate them and fix common causes of mis-calibration. In a second part, we will explore how to turn probabilistic classifiers into optimal business decision makers.

The tutorial material is available at the following URL: https://github.com/probabl-ai/calibration-cost-sensitive-learning


Detailed outline of the tutorial:

  • Introduction
    • Evaluting ML based predictions with:
      • ranking metrics,
      • probabilistic metrics,
      • decision metrics.
    • Proper scoring losses and their decomposition in:
      • calibration loss,
      • grouping loss,
      • irreducible loss.
  • Part I: Probabilistic classification
    • The calibration curve
    • Possible causes of miscalibration
      • Model misspecification
      • Overfitting and bad level of regularization
    • Possible ways to improve calibration
      • Non-linear feature engineering to avoid misspecification
      • Post-hoc calibration with Isotonic regression
      • Tuning parameters and early stopping with a proper-scoring rule
  • Part II: Optimal decision making under uncertainty
    • Defining a custom business cost functions
    • Individual-specific cost functions
    • Setting the Elkan-optimal threshold with FixedThresholdClassifier
    • Cost-sensitive learning for arbitrary cost functions with TunedThresholdClassifierCV
    • Predict-time decision threshold optimization.

This tutorial will be delivered as a set of publicly available Jupyter notebooks under an open source license.

We will mostly use components of the latest version of the scikit-learn library + a few custom extensions.

The tutorial material is available at the following URL: https://github.com/probabl-ai/calibration-cost-sensitive-learning


Abstract as a tweet:

Probabilistic classification and cost-sensitive learning with scikit-learn. Learn the power of hparam tuning with proper scoring rules and optimal decision thresold tuning on custom business rules.

Category [Machine and Deep Learning]:

Supervised Learning

Expected audience expertise: Domain:

expert

Expected audience expertise: Python:

some

Public link to supporting material:

https://scikit-learn.org/stable/auto_examples/model_selection/plot_cost_sensitive_learning.html

Project Homepage / Git:

https://scikit-learn.org

I'm an open source software engineer at :probabl. I'm a core developer of scikit-learn and `imbalanced-learn.

This speaker also appears in:

Olivier is a software engineer at Probabl and a core contributor to the scikit-learn open source Machine Learning library.

https://sigmoid.social/@ogrisel