EuroSciPy 2026

How to use skrub Data Ops in practice
2026-07-21 , Room 1.38 (Ground Floor, Turing)

Skrub is a package that eases preparing dataframes so they can be used in machine-learning tasks. In practice, data can be spread over multiple tables, represent various types of information (tabular, textual, graphical), or be stored on external database systems rather than dataframes.

Skrub Data Ops help with constructing versatile pipelines that can handle this variety of scenarios, while at the same time avoiding data leakage and allowing to build rich hyper-parameter grids that can be explored to maximize the performance of the final machine learning model.

In this talk, we give a brief introduction of the Data Ops framework before presenting three separate use cases highlighting their versatility: a traditional machine learning pipeline that uses Optuna to perform hyper-parameter tuning, a pipeline that trains on data stored in a relational database rather than a dataframe, and an image classification task with Pytorch.

By the end of the talk, attendees will learn about the skrub Data Ops, their main features and how they can be used successfully in different practical scenarios.


Building a machine learning pipeline is rarely a straightforward effort: data can be spread across multiple sources and storage formats; data preparation can involve multiple complex steps, unclear choices and assembling data coming from different sources; all operations must be executed while avoiding data leakage; there may be hyper-parameters to tune; and at the end of the process, it should be possible to re-execute all the same operations with the same parameters on unseen data.

Skrub Data Ops are a pipeline-building framework that alleviates these criticalities: Data Ops wrap around any arbitrary function provided by the user, including non-standard data fetching and preparation steps; they simplify combining tables by letting users adopt the dataframe library of their choosing; they keep track of samples throughout the pipeline construction and training to avoid data leakage; they simplify the construction of rich hyper-parameter search spaces thanks to a set of "choose from" functions that allows setting arbitrary operations as choices. Finally, Data Ops build a directed acyclic graph that tracks all the operations and estimators fitted up until a given point: this allows to retain the state of fitted estimators, and re-execute all the steps in the same way on unseen data.

Through the presentation, we will show how these features can be employed in practical scenarios. We start from a traditional machine learning pipeline built using the skrub Data Ops and employ Optuna as the backend for performing hyper-parameter search. We then move on to a multi-table scenario where tables are stored in a relational database: thanks to their support for arbitrary user code, it is not necessary to convert data to a dataframe format until training. We conclude the talk by presenting an example of image classification with Pytorch and skorch: Data Ops are not limited to tabular data and can handle other typical machine-learning tasks, while simplifying the code necessary to generate and test different model architectures.

All material for the talk will be made available online.


Expected audience expertise: Domain: some Expected audience expertise: Python: some Supporting material: Supporting material Project homepage or Git: Project homepage or Git Your relationship with the presented work/project: Original author or co-author, Active contributor, Maintainer of the presented library/project

I am a research engineer at Inria, part of P16 and of the SODA research team. I am one of the maintainers of the skrub Python package. I hold a PhD in Computer Science and I am also interested in research on tabular learning and tabular foundational models.

Guillaume is an open-source software engineer working at :probabl. He is a core maintainer of the scikit-learn and imbalanced-learn libraries.

This speaker also appears in: