2024-09-26 –, Gaston Berger
In this talk, we provide an update on the latest scikit-learn
features that have been implemented in versions 1.4 and 1.5. We will particularly discuss the following features:
- the metadata routing API allowing to pass metadata around estimators;
- the
TunedThresholdClassifierCV
allowing to tuned operational decision through custom metric; - better support for categorical features and missing values;
- interoperability of array and dataframe.
In this talk, we present a couple of important novelties available scikit-learn
since version 1.4.
First, we discuss the new metadata routing feature and its API. We start by giving a concrete example of metadata through the usage of sample_weight
and groups
. These parameter were previously available in many estimators, metrics, and evaluation tools. However, in old scikit-learn
version, this feature came with shortcomings that we examine. Then, we show how the new metadata routing API solves these challenges with examples that involve some nested-ness. Concretely, we show how to enable and use this routing on a use-case involving cross-validation, scikit-learn
pipeline, and grid-search. We further give another concrete example where metadata routing unlock an important feature: it is now possible to optimize a classifier decision threshold using a dedicated business metric. While this was not possible in the past, it is now possible in scikit-learn
thanks to metadata routing coupled with a new meta-estimator called the TunedThresholdClassifierCV
that allows optimizing the classification decision threshold.
Then, we present some native handling of categorical and missing data in some estimators (i.e., gradient-boosting and random forest) that leverage pandas
data types.
Finally, we present some advancements regarding interoperability with different input data. Notably, we show the premise of adopting the so-called Array-API (https://data-apis.org/array-api/). This work brings scikit-learn
to the GPU processing land and allows for improved performance for some algorithms. We showcase an example using a typical machine learning pipeline. In addition, we present the recent support for another dataframe library, polars
.
Guillaume is an open-source software engineer working at :probabl. He is a core maintainer of the scikit-learn and imbalanced-learn libraries.
Stefanie is an open source developer at :probabl. and a contributor to scikit-learn.