PyCon Lithuania 2024

Revenue based scoring in `GridSearchCV`: a case for the new metadata routing in scikit-learn
2024-04-05 , Room 203

Passing metadata such as sample_weight and groups through a scikit-learn cross_validate, GridSearchCV, or a Pipeline to the right estimators, scorers, and CV splitters has been either cumbersome, hacky, or impossible.

The new metadata routing mechanism in scikit-learn enables you to pass metadata through these objects. As a use-case, we study how you can implement a revenue sensitive scoring while doing a hyperparameter search within a GridSearchCV object.


In this talk we go through a use-case where we implement a custom scorer taking into account revenue while looking for the best hyper parameters in a GridSearchCV. This requires the new metadata routing available in the latest scikit-learn release.

This talk will give you an insight into this new feature, how it’s implemented, and how you can make use of it.

Historically, passing metadata such as sample_weight has not been consistent in scikit-learn, and with composite meta-estimators such as Pipeline, one needs to use the step_name__metadata syntax, which is brittle and one needs to repeat the same metadata parameter for different steps. With this syntax, it’s also been impossible to pass a Ppeline object as a sub-estimator of many other meta-estimators such as RFE.

With the new metadata routing available in the latest scikit-learn, it is now easy to pass metadata around. For example, here we pass sample_weight and groups to the splitter and the scorer objects, as well as the estimator in GridSearchCV.

estimator = Lasso().set_fit_request(sample_weight=True)
hyperparameter_grid = {"alpha": [0.1, 0.5, 1.0, 2.0]}
scorer = get_scorer("neg_mean_squared_error").set_score_request(
    sample_weight=True
)
cv = GroupKFold(n_splits=5)

grid_search = GridSearchCV(
    estimator=estimator,
    param_grid=hyperparameter_grid,
    cv=cv,
    scoring=scorer,
)

Adrin is a scikit-learn maintainer and works on a few other open source projects. He has a PhD in Bioinformatics, has worked as a consultant, and in an algorithmic privacy and fairness team. He is now a cofounder at probabl.ai, where they work on enabling people to do statistically sane machine learning.