Justine BEL-LETOILE PyData Paris 2025

Justine BEL-LETOILE
.ical

Justine is a data scientist at Hellowork, the French leader in talent aquisition, job search and course search tech. She spent the last 10+ years enjoying machine learning, python and other data science fun stuff in various fields. Her current work includes a good deal of natural language processing.

Session

10-01

10:05

30min

Balancing Privacy and Utility: Efficient PII Detection and Replacement in Textual Data

Justine BEL-LETOILE, Elizaveta Clouet

Anonymizing free-text data is harder than it seems. While structured databases have well-established anonymization techniques, textual data — like invoices, resumes, or medical records — poses unique challenges. Personally identifiable information (PII) can appear anywhere, in unpredictable formats, and how to modify it while preserving the dataset's usefulness?

Let's explore a practical, open-source 2-step approach to text anonymization: (1) detecting PII using NER models and (2) replacing it while preserving key dataset characteristics (e.g. document formatting, statistical distributions). We will demonstrate how to build a robust pipeline leveraging tools such as pre-trained PII detection models, gliner for fine-tuning, or Faker for generating meaningful replacements.

Ideal for those with a basic understanding of NLP, this session offers practical insights for anyone working with sensitive textual data.

Gaston Berger

Justine BEL-LETOILE .ical

Session

Justine BEL-LETOILE
.ical