An update on the latest scikit-learn features
2024-09-26 , Gaston Berger

In this talk, we provide an update on the latest scikit-learn features that have been implemented in versions 1.4 and 1.5. We will particularly discuss the following features:

  • the metadata routing API allowing to pass metadata around estimators;
  • the TunedThresholdClassifierCV allowing to tuned operational decision through custom metric;
  • better support for categorical features and missing values;
  • interoperability of array and dataframe.

In this talk, we present a couple of important novelties available scikit-learn since version 1.4.

First, we discuss the new metadata routing feature and its API. We start by giving a concrete example of metadata through the usage of sample_weight and groups. These parameter were previously available in many estimators, metrics, and evaluation tools. However, in old scikit-learn version, this feature came with shortcomings that we examine. Then, we show how the new metadata routing API solves these challenges with examples that involve some nested-ness. Concretely, we show how to enable and use this routing on a use-case involving cross-validation, scikit-learn pipeline, and grid-search. We further give another concrete example where metadata routing unlock an important feature: it is now possible to optimize a classifier decision threshold using a dedicated business metric. While this was not possible in the past, it is now possible in scikit-learn thanks to metadata routing coupled with a new meta-estimator called the TunedThresholdClassifierCV that allows optimizing the classification decision threshold.

Then, we present some native handling of categorical and missing data in some estimators (i.e., gradient-boosting and random forest) that leverage pandas data types.

Finally, we present some advancements regarding interoperability with different input data. Notably, we show the premise of adopting the so-called Array-API (https://data-apis.org/array-api/). This work brings scikit-learn to the GPU processing land and allows for improved performance for some algorithms. We showcase an example using a typical machine learning pipeline. In addition, we present the recent support for another dataframe library, polars.

Guillaume is an open-source software engineer working at :probabl. He is a core maintainer of the scikit-learn and imbalanced-learn libraries.

Stefanie is an open source developer at :probabl. and a contributor to scikit-learn.