2025-10-01 –, Gaston Berger
Anonymizing free-text data is harder than it seems. While structured databases have well-established anonymization techniques, textual data — like invoices, resumes, or medical records — poses unique challenges. Personally identifiable information (PII) can appear anywhere, in unpredictable formats, and how to modify it while preserving the dataset's usefulness?
Let's explore a practical, open-source 2-step approach to text anonymization: (1) detecting PII using NER models and (2) replacing it while preserving key dataset characteristics (e.g. document formatting, statistical distributions). We will demonstrate how to build a robust pipeline leveraging tools such as pre-trained PII detection models, gliner for fine-tuning, or Faker for generating meaningful replacements.
Ideal for those with a basic understanding of NLP, this session offers practical insights for anyone working with sensitive textual data.
Textual data such as medical prescriptions, invoices, or resumes often contain personally identifiable information (PII) — names, addresses, emails, dates of birth, etc. Protecting user privacy, compliance with GDPR regulations, and ensuring robust machine learning pipelines all require effective handling of these PIIs. But while traditional anonymization tools work well for tabular data, they fall short with free text due to its unstructured nature.
Why? Because PII can be anywhere, take various formats, and replacing sensitive information without loosing the text integrity is tricky.
This talk introduces a 2-step process to address this challenge:
1. Detect PII in free text leveraging named entity recognition (NER) models: we'll discuss pre-trained models, fine-tuned PII detection solutions and libraries like gliner
to adapt models to our specific context.
2. Generate replacements for detected PII: beyond simple substitution, we'll delve into creating meaningful replacements that preserve the original text's format and characteristics. This ensures that subsequent analyses remain valid. We'll discuss challenges like maintaining gender balance in names, urban/rural geographic distributions, or controlled format variety in fields like dates and emails.
Throughout the talk, we will demonstrate the power of the open-source machine learning community and the Python ecosystem in building a robust pseudonymization pipeline. Concrete code examples using the Faker
library will illustrate how to tailor solutions to specific needs, ensuring both privacy and data usefulness.
This talk is aimed at engineers, data scientists, and NLP practitioners with a basic understanding of text processing but is accessible to a broader technical audience. Attendees will gain insights into constructing their own textual data anonymization or fake data generation pipelines.
Justine is a data scientist at Hellowork, the French leader in talent aquisition, job search and course search tech. She spent the last 10+ years enjoying machine learning, python and other data science fun stuff in various fields. Her current work includes a good deal of natural language processing.
Data Scientist with a strong interest in NLP techniques. Elizaveta is currently working at Hellowork on the projects including document analysis, named entity recognition and recommendation systems.