2025-10-01 –, Gaston Berger
Most common machine learning models (linear, tree-based or neural network-based), optimize for the least squares loss when trained for regression tasks. As a result, they output a point estimate of the conditional expected value of the target: E[y|X]
.
In this presentation, we will explore several ways to train and evaluate probabilistic regression models as a richer alternative to point estimates. Those models predict a richer description of the full distribution of y|X
and allow us to quantify the predictive uncertainty for individual predictions.
On the model training part, we will introduce the following options:
- ensemble of quantile regressors for a grid of quantile levels (using linear models or gradient boosted trees in scikit-learn, XGBoost and PyTorch),
- how to reduce probabilistic regression to multi-class classification + a cumulative sum of the
predict_proba
output to recover a continuous conditional CDF. - how to implement this approach as a generic scikit-learn meta-estimator;
- how this approach is used to pretrain foundational tabular models (e.g. TabPFNv2).
- simple Bayesian models (e.g. Bayesian Ridge and Gaussian Processes);
- more specialized approaches as implemented in XGBoostLSS.
We will also discuss how to evaluate probabilistic predictions via:
- the pinball loss of quantile regressors,
- other strictly proper scoring rules such as Continuous Ranked Probability Score (CRPS),
- coverage measures and width of prediction intervals,
- reliability diagrams for different quantile levels.
We will illustrate of those concepts with concrete examples and running code.
Finally, we will illustrate why some applications need such calibrated probabilistic predictions:
- estimating uncertainty in trip times depending on traffic conditions to help a human decision make choose among various travel plan options.
- modeling value at risk for investment decisions,
- assessing the impact of missing variables for an ML model trained to work in degraded mode,
- Bayesian optimization for operational parameters of industrial machines from little/costly observations.
If time allows, will also discuss usage and limitations of Conformal Quantile Regressors as implemented in MAPIE and contrast aleatoric vs epistemic uncertainty captured by those models.
The material (slides and notebook) for the presentation will be linked here the day of the presentation.
Olivier is an open source fellow at probabl and a scikit-learn core contributor.