Machine learning with missing values
2022-09-01 , HS 118

This talk will cover how to build predictive models that handle well missing values, using scikit-learn. It will give on the one side the statistical considerations, both the classic statistical missing-values theory and the recent development in machine learning, and on the other side how to efficiently code solutions.


In many data-science applications, the data may come with missing values. There is a rich statistical literature on performing analysis with missing values. However, machine learning brings new tradeoffs: how to deal with missing-values at test time? Should we really care about recovering the model suitable for fully-observed data? I will cover both the classic theory and recent theoretical advances. I will show how scikit-learn can be used to implement various solutions, and how these illustrate the theory.

Tentative outline:
- The classic statistical view on missing values
- Missing at Random Settings: why it is important
- Imputation, and corresponding scikit-learn tools
- Prediction for missing values
- Simple predictors need very good predictors
- Rich predictors work with simple imputers, even outside Missing at Random settings


Expected audience expertise: Python

none

Public link to supporting material

https://www.slideshare.net/GaelVaroquaux/machine-learning-with-missing-values

Abstract as a tweet

Data with missing values! What does it take to build the best machine-learning model?

Domains

Machine Learning, Statistics

Expected audience expertise: Domain

some

Gaël Varoquaux is a research director working on data science and health at Inria (French Computer Science National research). His research focuses on using data and machine learning for scientific inference, with applications to health and social science, as well as developing tools that make it easier for non-specialists to use machine learning. He has been working going building easy-to-use open-source software in Python for above 15 years. He is a core developer of scikit-learn, joblib, Mayavi and nilearn, a nominated member of the PSF, and often teaches scientific computing with Python, eg as a creator of the scipy lecture notes.

This speaker also appears in: