PyConDE & PyData Berlin 2024

Your Model _Probably_ Memorized the Training Data
2024-04-22 , B07-B08

I know you probably don't want to hear about it, but your deep learning model probably memorized some of its training data. In this talk, we'll review active research on deep learning and memorization, particularly for large models such as large language and multi-modal models.

We'll also explore potential ways to think through when this memorization is actually desired (and why) as well as threat vectors and legal risk of using models who have memorized training data. We'll also look at potential privacy protections which could address some of the issues and how to embrace memorization by thinking through different types of models and their use.


In this talk, I will cover:

  • Proven mathematical research as to why deep learning models memorize information
  • A series of successful attacks against deep learning models and GPT-models to extract memorized information
  • The legal and social impact of memorization and using memorized data
  • Differential privacy as one potential solution (but also its pitfalls when used to train large models)
  • Federated and/or local- or community-trained models as an alternative
  • The need for distillation that also attempts to reduce memorization

Expected audience expertise: Domain

Intermediate

Expected audience expertise: Python

None

Abstract as a tweet (X) or toot (Mastodon)

So, just how much data did ChatGPT memorize? Let's find out!

Public link to supporting material, e.g. videos, Github, etc.

https://probablyprivate.com

Katharine Jarmul is a privacy activist and data scientist whose work and research focuses on privacy and security in data science workflows. She works as a Principal Data Scientist at Thoughtworks and author of Practical Data Privacy. She is a passionate and internationally recognized data scientist, programmer, and lecturer.

This speaker also appears in: