2022-08-29 –, HS 120
This tutorial will introduce how to leverage scikit-learn's powerful
histogram-based gradient boosted regression trees with various loss functions
(Least squares, Poisson and the pinball loss for quantile estimation) on a time
series forecasting problem. We will see how to leverage pandas to build lag and
windowing features and scikit-learn time-series cross-validation tools and other
model evaluation tools.
This tutorial is intended for an audience with some familiarity with data
science tools and machine learning concepts. It will start from practical
considerations on how to manipulate the data and fit simple yet powerful
models and progressively move to more advanced considerations on model
evaluation.
The main focus is to show how to cast a time series forecasting problem into a
supervised machine learning problem (non-linear regression) using basic
pandas-based feature engineering, time series aware cross-validation and
highlighting the impact of the choice of the loss function of gradient boosting
models.
We will compare this forecasting to a baseline that only leverages instantaneous
contextual variables as predictors using scikit-learns feature engineering tools
(column transformers and pipelines) with a particular emphasis how to build
cyclic time-derived features and categorical variables.
We will then dive deeper into model evaluation assessing various performance metrics with time-series aware cross-validation.
We will compare uncertainty bounds from quantile regression with conformal prediction methods from MAPIE.
Finally we will explore how to deal with the auto-regressive setting to predict forecast for a multi-step horizon with sktime
.
The tutorial will be available as a Jupyter notebook and the audience will be
encouraged to develop there how intuitions by experimenting interactively with
the teaching material.
If time allows, we will also compare this approach with alternative solutions
based on neural networks or linear models trained on rich spline-based features.
Tutorial by @ogrisel on how to build and evaluate time-series forecasting models using scikit-learn histogram gradient boosting regression trees, the pinball loss and pandas-based lag and windowing feature engineering.
Project Homepage / Git – Domains –Machine Learning, Open Source Library, Statistics
Expected audience expertise: Domain –some
Expected audience expertise: Python –some
Public link to supporting material –Olivier Grisel is a software engineer at Inria and a maintainer of the scikit-learn machine learning library.