Evaluating your machine learning models: beyond the basics
2022-08-29 , HS 120

This tutorial will guide towards good evaluation of machine-learning models, choosing metrics and procedures that match the intended usage, with code examples using the latest scikit-learn's features. We will discuss how good metrics should characterize all aspects of error, e.g. on the positive and negative class; the probability of a detection, or the probability of a true event given a detection; as they may need to catter for class imbalance. Metrics may also evaluate confidence scores, e.g. calibration. Model-evaluation procedures should gauge not only the expected generalization performance, but also its variations.


Model evaluation is a crucial aspect of machine-learning, to choose the best model, or to decide if a given model is good-enough for production. This tutorial will give didactic introductions to the various statistical aspects of model evaluation: what aspects of model prediction are important to capture, and how different metrics available in scikit-learn captures them. How to devise a model-evaluation procedure that is best suited to select the best model, or control that a model is suited for usage. This tutorial goes beyond mere application of scikit-learn and we expect even experts to learn useful considerations.

The tutorial will be loosely based on the following preprint https://hal.archives-ouvertes.fr/hal-03682454, but with code examples for each important concept. A tentative outline is as follows:

Performance metrics

Metrics for classification

  • Binary classification
    • Confusion matrix
    • Simple summaries and their pitfalls
    • Probability of detection given true class, or vice versa?
    • Summary metrics for low prevalence
    • Metrics for shifts in prevalence
    • Multi-threshold metrics
    • Confidence scores and calibration
  • Multi-class classification
    • Adapting binary metrics to multi-class settings
    • Multilabel classification

Metrics for regression

  • R2 score
  • Absolute error measures
  • Assessing the distribution of errors

Evaluation strategies

Evaluating a learning procedure

  • Cross-validation strategies
  • Driving model choices: nested cross-validation
  • Statistical testing
    • Sources of variance
    • Accounting for benchmarking variance

Evaluating generalization to an external population

  • The notion of external validity
  • Confidence intervals for external validation

Domains

Machine Learning, Medicine/Health, Statistics

Public link to supporting material

https://hal.archives-ouvertes.fr/hal-03682454

Abstract as a tweet

Evaluating machine-learning models beyond the basics: metrics suitable for low or varying prevalence, confidence intervals

Expected audience expertise: Domain

some

Expected audience expertise: Python

some

Gaël Varoquaux is a research director working on data science and health at Inria (French Computer Science National research). His research focuses on using data and machine learning for scientific inference, with applications to health and social science, as well as developing tools that make it easier for non-specialists to use machine learning. He has been working going building easy-to-use open-source software in Python for above 15 years. He is a core developer of scikit-learn, joblib, Mayavi and nilearn, a nominated member of the PSF, and often teaches scientific computing with Python, eg as a creator of the scipy lecture notes.

This speaker also appears in: