Evaluating your machine learning models: beyond the basics EuroSciPy 2022

Evaluating your machine learning models: beyond the basics
.ical

2022-08-29 13:30–15:00, HS 120

This tutorial will guide towards good evaluation of machine-learning models, choosing metrics and procedures that match the intended usage, with code examples using the latest scikit-learn's features. We will discuss how good metrics should characterize all aspects of error, e.g. on the positive and negative class; the probability of a detection, or the probability of a true event given a detection; as they may need to catter for class imbalance. Metrics may also evaluate confidence scores, e.g. calibration. Model-evaluation procedures should gauge not only the expected generalization performance, but also its variations.

Model evaluation is a crucial aspect of machine-learning, to choose the best model, or to decide if a given model is good-enough for production. This tutorial will give didactic introductions to the various statistical aspects of model evaluation: what aspects of model prediction are important to capture, and how different metrics available in scikit-learn captures them. How to devise a model-evaluation procedure that is best suited to select the best model, or control that a model is suited for usage. This tutorial goes beyond mere application of scikit-learn and we expect even experts to learn useful considerations.

The tutorial will be loosely based on the following preprint https://hal.archives-ouvertes.fr/hal-03682454, but with code examples for each important concept. A tentative outline is as follows:

Performance metrics

Metrics for classification

Binary classification
- Confusion matrix
- Simple summaries and their pitfalls
- Probability of detection given true class, or vice versa?
- Summary metrics for low prevalence
- Metrics for shifts in prevalence
- Multi-threshold metrics
- Confidence scores and calibration
Multi-class classification
- Adapting binary metrics to multi-class settings
- Multilabel classification

Metrics for regression

R2 score
Absolute error measures
Assessing the distribution of errors

Evaluation strategies

Evaluating a learning procedure

Cross-validation strategies
Driving model choices: nested cross-validation
Statistical testing
- Sources of variance
- Accounting for benchmarking variance

Evaluating generalization to an external population

The notion of external validity
Confidence intervals for external validation

Public link to supporting material:

https://hal.archives-ouvertes.fr/hal-03682454

Abstract as a tweet:

Evaluating machine-learning models beyond the basics: metrics suitable for low or varying prevalence, confidence intervals

Domains:

Machine Learning, Medicine/Health, Statistics

Expected audience expertise: Domain:

some

Expected audience expertise: Python:

some

Gaël Varoquaux

Gaël Varoquaux is a research director working on data science and health at Inria (French Computer Science National research). His research focuses on using data and machine learning for scientific inference, with applications to health and social science, as well as developing tools that make it easier for non-specialists to use machine learning. He has been working going building easy-to-use open-source software in Python for above 15 years. He is a core developer of scikit-learn, joblib, Mayavi and nilearn, a nominated member of the PSF, and often teaches scientific computing with Python, eg as a creator of the scipy lecture notes.

This speaker also appears in:

Machine learning with missing values

Arturo Amor