Let’s exploit pickle, and `skops` to the rescue!
2023-08-17 , Aula

Pickle files can be evil and simply loading them can run arbitrary code on your system. This talk presents why that is, how it can be exploited, and how skops is tackling the issue for scikit-learn/statistical ML models. We go through some lower level pickle related machinery, and go in detail how the new format works.


The pickle format has many vulnerabilities and loading them alone can run arbitrary code on the user’s system [1]. In this session we go through the process used by the pickle module to persist python objects, while demonstrating how they can be exploited. We go through how __getstate__ and __setstate__ are used, and how the output of a __reduce__ method is used to reconstruct an object, and how one can have a malicious implementation of these methods to create a malicious pickle file without knowing how to manually create a pickle file by manipulating a file on a lower level. We also briefly touch on other known exploits and issues related to the format [2].

We also show how one can look inside a pickle file and the operations run by it while loading it, and how one could get an equivalent python script which would result in the output of the pickle file [3]
Then I present an alternative format from the skops library [4] which can be used to store scikit-learn based models. We talk about what the format is, and how persistence and loading is done, and what we do to prevent loading malicious objects or to avoid running arbitrary code. This format can be used to store almost any scikit-learn estimator, as well as xgboost, lightgbm, and catboost models.


Abstract as a tweet:

Let’s exploit pickle, and skops to the rescue! Why pickle is dangerous an how to mitigate some of the issues.

Category [Machine and Deep Learning]:

Other

Expected audience expertise: Domain:

some

Expected audience expertise: Python:

some

Project Homepage / Git:

https://skops.readthedocs.io/en/stable/

Adrin works on a few projects, including skops which tackles some of the MLOps challenges related to scikit-learn models. He has a PhD in Bioinformatics, has worked as a consultant, as well as working in an algorithmic privacy and fairness team. He's also a core developer of scikit-learn and fairlearn.