2024-04-05 –, Room 203
Passing metadata such as sample_weight
and groups
through a scikit-learn cross_validate
, GridSearchCV
, or a Pipeline
to the right estimators, scorers, and CV splitters has been either cumbersome, hacky, or impossible.
The new metadata routing mechanism in scikit-learn enables you to pass metadata through these objects. As a use-case, we study how you can implement a revenue sensitive scoring while doing a hyperparameter search within a GridSearchCV
object.
In this talk we go through a use-case where we implement a custom scorer taking into account revenue while looking for the best hyper parameters in a GridSearchCV
. This requires the new metadata routing available in the latest scikit-learn release.
This talk will give you an insight into this new feature, how it’s implemented, and how you can make use of it.
Historically, passing metadata such as sample_weight
has not been consistent in scikit-learn, and with composite meta-estimators such as Pipeline
, one needs to use the step_name__metadata
syntax, which is brittle and one needs to repeat the same metadata parameter for different steps. With this syntax, it’s also been impossible to pass a Ppeline
object as a sub-estimator of many other meta-estimators such as RFE
.
With the new metadata routing available in the latest scikit-learn, it is now easy to pass metadata around. For example, here we pass sample_weight
and groups
to the splitter and the scorer objects, as well as the estimator in GridSearchCV
.
estimator = Lasso().set_fit_request(sample_weight=True)
hyperparameter_grid = {"alpha": [0.1, 0.5, 1.0, 2.0]}
scorer = get_scorer("neg_mean_squared_error").set_score_request(
sample_weight=True
)
cv = GroupKFold(n_splits=5)
grid_search = GridSearchCV(
estimator=estimator,
param_grid=hyperparameter_grid,
cv=cv,
scoring=scorer,
)
Adrin is a scikit-learn maintainer and works on a few other open source projects. He has a PhD in Bioinformatics, has worked as a consultant, and in an algorithmic privacy and fairness team. He is now a cofounder at probabl.ai, where they work on enabling people to do statistically sane machine learning.