PyCon DE & PyData 2025

Unlocking the Predictive Power of Relational Data with Automated Feature Engineering
2025-04-24 , Ferrum

Relational data can be a goldmine for classical Machine Learning applications — yet extracting useful features from multiple tables, time windows, and primary-foreign key relationships is notoriously difficult. In this code tutorial, we’ll use the H&M Fashion dataset to demonstrate how getML FastProp automates feature engineering for both classification (churn prediction) and regression (sales prediction) with minimal manual effort, outperforming both Relational Deep Learning and a skilled human data scientist according to the RelBench leaderboard.

This code tutorial is perfect for data scientists looking to leverage their relational and time-series data data effectively for any kind of predictive analytics applications.


This tutorial tackles a common pain point in data science – extracting useful features from relational data spread across multiple interconnected tables. Manually crafting these features is often tedious, error-prone, and heavily reliant on domain expertise.

Why is this important? Relational data powers industries from e-commerce and healthcare to finance. Yet, building predictive models on such datasets often involves laborious feature engineering. getML FastProp – the fastest open-source algorithm for automated feature engineering – streamlines this process, helping data scientists move faster and build better models.

In this hands-on tutorial, we’ll work through two tasks from Stanford’s Relational Learning Benchmark (RelBench) using the H&M Fashion dataset: 1) Predict customer churn with a classification model, 2) Forecast item sales using regression model.

We’ll walk through the code and concepts needed to solve these tasks with getML FastProp, achieving state-of-the-art performance and outperforming both Relational Deep Learning models and an experienced human data scientist.

By the end of this tutorial, you'll learn how to:
- Understand relational learning – Grasp the core challenges and concepts of working with multi-table datasets.
- Reproduce results – Run the provided notebooks and code to reproduce the results at your own pace.
- Automate feature engineering – Use getML’s FastProp to extract features directly from relational data.
- Build and optimize getML pipelines – Develop pipelines for both classification and regression tasks.
- Integrate into MLOps workflows – Leverage getML alongside LightGBM and Optuna.

This tutorial provides a practical, reproducible framework for working with relational and time-series data, applicable across industries and domains.


Expected audience expertise: Domain:

Intermediate

Expected audience expertise: Python:

Intermediate

Public link to supporting material, e.g. videos, Github, etc.:

https://github.com/getml/getml-relbench

Alexander Uhlig is the CEO of Code17, the company behind getML. With a background in Physics, he leads the development of getML and has worked hands-on with data teams to build prediction models across various domains, including healthcare, trading, and e-commerce.