Predictive survival analysis with scikit-learn, scikit-survival and lifelines
2023-08-14 , Aula

This tutorial will introduce how to train machine learning models for time-to-event prediction tasks (health care, predictive maintenance, marketing, insurance...) without introducing a bias from censored training (and evaluation) data.


Tutorial notebooks:

According to Wikipedia:

Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs, such as deaths in biological organisms and failure in mechanical systems. [...]. Survival analysis attempts to answer certain questions, such as what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival?

In this tutorial we will deep dive into a practical case study of predictive maintenance using tools from the scientific Python ecosystem. Here is a tentative agenda:

  • What is time-censored data and why it is a problem to train time-to-event regression models.
  • Single event survival analysis with Kaplan-Meier using scikit-survival.
  • Evaluation of the calibration of survival analysis estimators using the integrated brier score (IBS) metric.
  • Predictive survival analysis modeling with Cox Proportional Hazards, Survival Forests using scikit-survival, GradientBoostedIBS implemented from scratch with scikit-learn.
  • How to use a trained GradientBoostedIBS model to estimate the median survival time and the probability of survival at a fixed time horizon.
  • Inspecting the learned statistical association between input features and survival probabilities using partial dependence plot.

The tutorial notebooks also contain additional material that we probably won't have time to present in 90 min, namely:

  • Competing risks modeling with Nelson–Aalen, Aalen-Johansen using lifelines.
  • Estimation of the cause-specific cumulative incidence function (CIF) using our GradientBoostedIBS model.
  • Extracting implicit failure data from operation logs using sessionization with Ibis and DuckDB.

Target audience: good familiarity with machine learning concepts, with prior experience using scikit-learn (you know what cross-validation means and how to fit a Random Forest on a Pandas dataframe).


Category [Machine and Deep Learning]

Supervised Learning

Expected audience expertise: Domain

expert

Expected audience expertise: Python

some

Public link to supporting material

https://vincent-maladiere.github.io/survival-analysis-demo

Abstract as a tweet

Survival Analysis with scikit-learn, scikit-survival and lifelines

Machine Learning software engineer at Inria and member of the maintainers' team of the scikit-learn open source project.

This speaker also appears in:

Machine Learning Engineer at Inria • Contributor of scikit-learn, skrub and hazardous • Eager to talk about deploying stuff and MLOps :)