Unlock the full predictive power of your multi-table data PyData Paris 2025

Unlock the full predictive power of your multi-table data
.ical

2025-09-30 12:00–12:30, Gaston Berger

While most machine learning tutorials and challenges focus on single-table datasets, real-world enterprise data is often distributed across multiple tables, such as customer logs, transaction records, or manufacturing logs. In this talk, we address the often-overlooked challenge of building predictive features directly from raw, multi-table data. You will learn how to automate feature engineering using a scalable, supervised, and overfit-resistant approach, grounded in information theory and available as a Python open-source library. The talk is aimed at data scientists and ML engineers working with structured data; basic machine learning knowledge is sufficient to follow.

Machine learning projects rarely operate on clean, single-table datasets, especially in enterprise environments, where raw data is often distributed across multiple, interconnected tables. Think of customer interactions stored as logs, transactional records linked to user profiles, or manufacturing processes recorded through sequential machine logs. Yet, most ML workflows and tools still assume flattened inputs, forcing data scientists to manually transform and aggregate data in not optimal ways that are often lossy, time-consuming, and prone to overfitting.

In this talk, we address this gap by presenting a principled and scalable method for automatically constructing informative features directly from raw, multi-table data. Instead of relying on handcrafted pipelines or flattening heuristics, the method, implemented in a Python open-source library, leverages a supervised, model-driven strategy grounded in information theory (Minimum Description Length), that guides feature construction through supervised algorithms. This results in interpretable, non-redundant features, with built-in regularization and no hyperparameter tuning required. The outputs are model-agnostic and naturally robust to overfitting.

This approach has been used in production for several years, particularly in domains like fraud detection, where patterns evolve quickly (concept drift), and models must be retrained frequently. Its ability to adapt rapidly to behavioral changes, provide audit-friendly explanations, and operate efficiently at scale has made it well-suited for industrial applications involving high data volumes and regulatory constraints.

Attendees should be familiar with basic supervised learning concepts (e.g. features, labels, overfitting). No prior knowledge of MDL or information theory is required; the necessary concepts will be introduced and motivated during the talk.

Outline and Time Breakdown:
- (0–5 min) Context – Why real-world data is relational, not flat
- (5–10 min) Challenges – Flattening, bias, leakage, scalability
- (10–15 min) Method – MDL framework, supervised aggregation, computational insights
- (15–25 min) Case study – Accident prediction with multi-table open data
- (25–30 min) Wrap-up – Takeaways, resources, and Q&A

Key Takeaways
- A rigorous, scalable methodology for generating supervised features from multi-table data
- Understanding of how and why the method avoids overfitting without manual tuning
- Ready-to-use Python code to apply this approach in your own ML pipelines — available as a fully open-source library.

Luc-Aurélien Gauthier

I’m a machine learning specialist with a background in both research and industry. After completing a PhD in machine learning, I applied my expertise in industrial settings at Safran and Orange, focusing on anomaly prediction, defect detection, and fraud analysis.

Alexis Bondu

Alexis Bondu is a Machine Learning researcher at Orange Research. His fields of research are varied and cover machine learning (Auto ML), active learning, weakly supervised learning, time series, data streams and early decision making. He is also responsible for the research part of the Khiops project, which is an Auto ML solution developed over the last twenty years in-house at Orange, and which has now been distributed as Open Source for around two years. The aim of this research work is to prepare the new functionalities and algorithms that will appear in future versions of Khiops.

Unlock the full predictive power of your multi-table data .ical 2025-09-30 12:00–12:30, Gaston Berger

Unlock the full predictive power of your multi-table data
.ical

2025-09-30 12:00–12:30, Gaston Berger