PyCon DE & PyData 2026

Andreas Wurl

Lead Data Scientist at Merck Healthcare KGaA
Clinical Measurement Sciences, Biomarker development

see Linkedin


Session

04-15
16:55
30min
Octopus AutoML: Extracting Signal from Small and High-Dimensional Data
Nils Haase, Andreas Wurl

Many machine learning tools assume abundant, independent data, rely on a single data split plus cross-validation, and leave test-set separation to the user.

In application-driven domains such as industrial materials science and pharmaceutical development, data are scarce, high-dimensional, and often correlated, creating conditions under which standard ML pipelines frequently fail. Small datasets are highly sensitive to the random seed used for splitting, and common pitfalls such as feature selection before splitting or distributing correlated samples across train and test sets cause data leakage and inflated performance metrics.

Octopus is an open-source Python AutoML library explicitly designed for small-data, high-dimensional regime. It enforces strict nested cross-validation for model and hyperparameter selection, quantifies performance variability across multiple splits, and tightly controls data leakage. Its modular architecture embeds an internal ML engine, several feature selection methods (e.g., MRMR, Boruta), and external AutoML solutions such as AutoGluon into a unified, rigorous validation framework, enabling systematic and fair comparison of methods on limited data. In addition, Octopus supports survival analysis, addressing time-to-event problems common in healthcare and materials science. This talk will use realistic small-scale datasets to illustrate how conventional pipelines can be misleading and how to obtain more reliable models when every sample matters.

PyData: Machine Learning & Deep Learning & Statistics
Europium [3rd Floor]