PyCon DE & PyData 2025

Duplicate record detection using GenAI techniques to improve data quality
2025-04-23 , Platinum3

Duplicate records can have a negative impact on many areas of a business. Current methods to detect duplicate records use traditional NLP techniques known as “Entity Matching”. An improvement to this traditional method can be achieved by incorporating GenAI techniques that do not entail any calls to OpenAI. Not only does this produce better matches, but it also keeps the data safe, since no information is transferred externally.


  • A description of the problem of duplicate records and their impact on businesses
  • An overview of the proposed solution
  • How to use GenAI models and techniques to identify potential duplicate records
  • Step 1: identify your columns to match on
  • Step 2: creating embedding vectors for these columns
  • Step 3: creating match clusters
  • Step 4: presenting those cluster to the users who can then choose what to do with the duplicates

Expected audience expertise: Domain:

Intermediate

Expected audience expertise: Python:

Intermediate

Public link to supporting material, e.g. videos, Github, etc.:

https://github.com/ianormy/genai_duplicate_detection_paper

Ian Ormesher is a seasoned full-stack Data Scientist with a robust background in training and deploying AI models in production environments. With a career spanning over four decades, he has honed his skills in Machine Learning, Deep Neural Networks, Reinforcement Learning, and Computer Vision. He is proficient in a wide array of programming languages and data analysis tools with a proven track record of implementing data-oriented solutions in the Cloud.