Skrub: machine learning for dataframes
2025-10-01 , Louis Armand 1 - Est

Skrub is an open source package that simplifies machine-learning with dataframes by providing a variety of tools to explore, prepare and feature-engineer dataframes so they can be integrated into scikit-learn pipelines. Skrub DataOps allow to build extensive, multi-table wrangling plans, explore hyperparameter spaces, and export the resulting objects for deployment.
The talk showcases various use cases where skrub can simplify the job of a data scientist from data preparation to deployment, through code examples and demonstrations.


Machine-learning algorithms expect a numeric array with one row per observation. Typically, creating this table requires 'wrangling' with Pandas or Polars (aggregations, selections, joins, etc.), and extracting numeric features from structured data types such as datetimes. These transformations must be applied consistently when making predictions for unseen inputs, and choices must be informed by performance measured on a validation dataset while preventing data leakage. This preprocessing is often the most difficult and time-consuming part of many data science projects.

Skrub bridges the gap between complex tabular data stored in Pandas or Polars dataframes, and machine-learning algorithms implemented by scikit-learn estimators. It provides transformers to extract features from datetimes, (fuzzy) categories, and text. Its pre-built, flexible learners offer very robust performance on many tabular datasets without requiring manual tweaking. It can create complex pipelines that handle multiple tables, while easily describing and searching rich hyperparameter spaces. As interactivity and visualization are essential for preprocessing, skrub also provides an interactive report to explore a dataframe. Additionally, its pipelines can be built incrementally while inspecting intermediate results.

The talk covers the main features of skrub through various common (15 min) and advanced (10 min) use cases, demonstrating how skrub can simplify these tasks compared to standard libraries, with code examples. The intended audience includes data scientists and researchers who typically need to combine dataframe libraries such as Polars and Pandas with scikit-learn pipelines.

Slides: https://skrub-data.org/skrub-materials/pages/slides/pydata-2025/slides.html
Github repo: https://github.com/skrub-data/skrub/
Website: https://skrub-data.org/stable/index.html
Additional material: https://skrub-data.org/skrub-materials/

I am a research engineer at Inria working on open-source Python packages for data-science.

I am a research engineer at Inria, part of P16 and of the SODA research team. I am the lead developer of the skrub Python package and spend most of my time on that, but I am also interested in research on tabular learning and tabular foundational models.