PPML: Machine Learning on data you cannot see
2023-08-15 , Aula

Privacy guarantee is the most crucial requirement when it comes to analyse sensitive data. However, data anonymisation techniques alone do not always provide complete privacy protection; moreover Machine Learning models could also be exploited to leak sensitive data when attacked, and no counter-measure is applied. Privacy-preserving machine learning (PPML) methods hold the promise to overcome all these issues, allowing to train machine learning models with full privacy guarantees. In this tutorial we will explore several methods for privacy-preserving data analysis, and how these techniques can be used to safely train ML models without actually seeing the data.


Privacy guarantees are the most crucial requirement when it comes to analyse sensitive data. These requirements could be sometimes very stringent, so that it becomes a real barrier for the entire pipeline. Reasons for this are manifold, and involve the fact that data could not be shared nor moved from their silos of resident, let alone analysed in their raw form. As a result, data anonymisation techniques are sometimes used to generate a sanitised version of the original data. However, these techniques alone are not enough to guarantee that privacy will be completely preserved. Moreover, the memoisation effect of Deep learning models could be maliciously exploited to attack the models, and reconstruct sensitive information about samples used in training, even if these information were not originally provided.

Privacy-preserving machine learning (PPML) methods hold the promise to overcome all those issues, allowing to train machine learning models with full privacy guarantees.

This workshop will be mainly organised in two main parts. In the first part, we will focus on Machine learning demonstrating how DL models could be exploited (i.e. inference attack ) to reconstruct original data solely analysing models predictions; and then we will explore how differential privacy can help us protecting the privacy of our model, with minimum disruption to the original pipeline.

In the second part we will considering more complex ML scenarios to train Deep learning networks on encrypted data, with specialised distributed federated learning strategies.


Category [High Performance Computing]

Parallel Computing

Expected audience expertise: Python

some

Category [Community, Education, and Outreach]

Learning and Teaching Scientific Python

Category [Scientific Applications]

Astronomy

Category [Data Science and Visualization]

Data Analysis and Data Engineering

Abstract as a tweet

Privacy Preserving Machine Learning: ML on Data you cannot see

Category [Machine and Deep Learning]

Algorithmic bias and Trustworthy AI

Expected audience expertise: Domain

some

Valerio Maggio is a Researcher, and a Data scientist Advocate at Anaconda. He is well versed in open science and research software, supporting the adoption of best software development practice (e.g. Code Review) in Data Science. He has been recently awarded a fellowship from the Software Sustainability Institute (profile) focused on developing open teaching modules [1][2] on Privacy-Preserving Machine learning technologies. Valerio is also an open-source contributor, and an active member of the Python community. Over the last twelve years he has contributed and volunteered in the organization of many international conferences and community meetups like PyCon Italy, PyData, EuroPython, and EuroSciPy. All his talks, workshop materials and open source contributions are publicly available on his Speaker Deck and GitHub profiles.