dirty_cat : a Python package for Machine Learning on Dirty Categorical Data EuroSciPy 2022

dirty_cat : a Python package for Machine Learning on Dirty Categorical Data
.ical

In this talk, we will introduce "dirty_cat", a Python library for encoding dirty, non-curated categorical features into numerical features while preserving similarities.
We will focus on a few methods implemented in the similarity encoder, the Gamma-Poisson encoder, the min-hash encoder and the super-vectorizer.

Machine learning models are used in many applications to predict a target variable given features that provides additional information. For instance, we can be interested in predicting the salary of an employee (target) given its job title and experience level (features). However, machine learning algorithms only operate on numerical features. When dealing with categorical features like job titles, a preprocessing step must be performed in order to encode them into numerical features.

A naive way to encode such data into numerical features is one-hot encoding: each job title is assigned a binary vector that is orthogonal to those of other jobs. This simple method encodes each category (job title) independently from the others, and therefore fails to capture similarities between them. For instance, "Police Captain" is more similar to "Police Cadet" than "Property Manager II" and thus they are more likely to have similar salaries. Preserving this information when encoding categories is crucial to the performance of the machine learning model.

Dirty-cat is a Python package that provides tools to easily encode categorical data into numerical features while capturing similarities between categories: after encoding, similar categories are described by vectors with close coefficients, thus improving prediction. The package provides a user-friendly interface to easily transform a raw table with string features into a clean table that can be directly leveraged by machine learning models.

We will focus on methods developed in "Similarity encoding for learning with dirty categorical variables" and "Encoding high-cardinality string categorical variables" by Patricio Cerda et al., and implemented in the Gamma-Poisson encoder, the similarity encoder, the min-hash encoder, and the super-vectorizer.

Public link to supporting material:

https://github.com/dirty-cat/dirty_cat/

Abstract as a tweet:

In this talk, we will introduce "dirty_cat", a Python library for encoding dirty, non-curated categorical features into numerical features while preserving similarities. We will focus on a few methods implemented in the similarity encoder,

Project Homepage / Git:

https://github.com/dirty-cat/dirty_cat/

Domains:

Machine Learning, Open Source Library

Expected audience expertise: Domain:

none

Expected audience expertise: Python:

none

Lilian Boulard

I'm currently (June 2022) a student in mathematics and computer science at the Paris-Saclay University.
I'm doing my master's degree in apprenticeship at the National Institute for Research in Computer Science and Automation (Inria), where I work under the supervision of Gaël Varoquaux, PhD, in the Soda team, on the research subject of dirty and missing data in machine learning settings.

dirty_cat : a Python package for Machine Learning on Dirty Categorical Data .ical

dirty_cat : a Python package for Machine Learning on Dirty Categorical Data
.ical