2023-08-17 –, Aula
Pickle files can be evil and simply loading them can run arbitrary code on your system. This talk presents why that is, how it can be exploited, and how skops
is tackling the issue for scikit-learn/statistical ML models. We go through some lower level pickle related machinery, and go in detail how the new format works.
The pickle format has many vulnerabilities and loading them alone can run arbitrary code on the user’s system [1]. In this session we go through the process used by the pickle module to persist python objects, while demonstrating how they can be exploited. We go through how __getstate__
and __setstate__
are used, and how the output of a __reduce__
method is used to reconstruct an object, and how one can have a malicious implementation of these methods to create a malicious pickle file without knowing how to manually create a pickle file by manipulating a file on a lower level. We also briefly touch on other known exploits and issues related to the format [2].
We also show how one can look inside a pickle file and the operations run by it while loading it, and how one could get an equivalent python script which would result in the output of the pickle file [3]
Then I present an alternative format from the skops
library [4] which can be used to store scikit-learn based models. We talk about what the format is, and how persistence and loading is done, and what we do to prevent loading malicious objects or to avoid running arbitrary code. This format can be used to store almost any scikit-learn estimator, as well as xgboost, lightgbm, and catboost models.
Let’s exploit pickle, and skops
to the rescue! Why pickle is dangerous an how to mitigate some of the issues.
Other
Expected audience expertise: Domain –some
Expected audience expertise: Python –some
Project Homepage / Git –Adrin works on a few projects, including skops which tackles some of the MLOps challenges related to scikit-learn models. He has a PhD in Bioinformatics, has worked as a consultant, as well as working in an algorithmic privacy and fairness team. He's also a core developer of scikit-learn and fairlearn.