EuroSciPy 2024

Welcome to this tutorial on Python's magic methods, often underestimated or overlooked in programming practice. And yet, magic methods make it easy to write code that implements often very complex algorithms in a readable way.

In this tutorial, we will learn how to create and use magic methods and talk about some interesting non-trivial magic methods.

This tutorial will have classroom exercises, post class homeworks as well as complimentary readings. All the presentation, code, exercises will be shared in advance (~ 2 - 3 days) and the solutions of the exercise will be shared after the tutorial is completed.

Community, Education, and Outreach

Room 5

10:30

30min

Break

Room 6

10:30

30min

Break

Room 5

11:00

90min

Decorators - A Deep Dive

Mike Müller

Python offers decorator to implement re-usable code for cross-cutting task.
The support the separation of cross-cutting concerns such as logging, caching,
or checking of permissions.
This can improve code modularity and maintainability.

This tutorial is an in-depth introduction to decorators.
It covers the usage of decorators and how to implement simple and more advanced
decorators.
Use cases demonstrate how to work with decorators.
In addition to showing how functions can use closures to create decorators,
the tutorial introduces callable class instance as alternative.
Class decorators can solve problems that use be to be tasks for metaclasses.
The tutorial provides uses cases for class decorators.

While the focus is on best practices and practical applications, the tutorial
also provides deeper insight into how Python works behind the scene.
After the tutorial participants will feel comfortable with functions that take
functions and return new functions.

Scientific Applications

Room 5

11:00

90min

Introduction to NumPy

Sarah Diot-Girard

Are you starting to use Python for scientific computing? Join this tutorial to know more about NumPy, the building block for nearly all libraries in the scientific ecosystem.
You will learn how to manipulate NumPy arrays, understand how they store data and discover how to get optimal performances. By the end of this tutorial, you will be able to start working with NumPy and know the main pitfalls to avoid.

Data Science and Visualisation

Room 6

12:30

90min

Lunch

Room 6

12:30

90min

Lunch

Room 5

14:00

90min

Introduction to matplotlib for Data Visualization with Python

Umut Nefta Kanilmaz

matplotlib is a library for creating visualizations with Python which "...makes easy things easy and hard things possible" (https://matplotlib.org/). This tutorial, intended for beginners, will introduce the library and explain core concepts as well as the main interfaces. Starting with styling simple point data plots, we will explain how to work with several dimensions, shared axes and advanced styling options using rcParams. After completing this tutorial, participants will hopefully be equipped with a thorough understanding of matplotlib to navigate the "hard things" in the world of data visualization.

Data Science and Visualisation

Room 6

14:00

90min

Probabilistic classification and cost-sensitive learning with scikit-learn

Olivier Grisel, Guillaume Lemaitre

Data scientists are repeatedly told that it is absolutely critical to align their model training methodology with a specific business objective. While being a rather good advice, it usually falls short on details on how to achieve this in practice.

This hands-on tutorial aims to introduce helpful theoretical concepts and concrete software tools to help them bridge this gap. This method will be illustrated on a worked practical use case: optimizing the operations of a fraud detection system for a payment processing platform.

More specifically, we will introduce the concepts of calibrated probabilistic classifiers, how to evaluate them and fix common causes of mis-calibration. In a second part, we will explore how to turn probabilistic classifiers into optimal business decision makers.

The tutorial material is available at the following URL: https://github.com/probabl-ai/calibration-cost-sensitive-learning

Machine and Deep Learning

Room 5

15:30

30min

Break

Room 6

15:30

30min

Break

Room 5

16:00

90min

Image analysis in Python with scikit-image

Marianne Corvellec, Lars Grüter, Stéfan van der Walt

Scientists are producing more and more images with telescopes, microscopes, MRI scanners, etc. They need automatable tools to measure what they've imaged and help them turn these images into knowledge. This tutorial covers the fundamentals of algorithmic image analysis, starting with how to think of images as NumPy arrays, moving on to basic image filtering, and finishing with a complete workflow: segmenting a 3D image into regions and making measurements on those regions.

Data Science and Visualisation

Room 6

16:00

90min

Using the Array API to write code that runs with Numpy, Cupy and PyTorch

Tim Head, Sebastian Berg

Python code that works with Numpy, Cupy and PyTorch arrays? Use a GPU when possible, but fallback to using a CPU if there is none? We will show you how you can write Python code that can do all of the above. The not so secret ingredient to do this is the Array API. In this workshop you will learn what the Array API is and how to use it to write programs that can take any compatible array as input.

High Performance Computing

Room 5

09:00

90min

Building robust workflows with strong provenance

Alexander Goscinski, Julian Geiger, Ali Khosravi

In computational science, different software packages are often glued together as scripts to perform numerical experiments. With increasing complexity, these scripts become unmaintainable, prone to crashes, hard to scale up and to collaborate on. AiiDA solves these problems via a powerful workflow engine and by keeping provenance for the entire workflow. In this tutorial, we learn how to create dynamic workflows combining together different executables that automatically can restart from failed runs and reuse results from completed calculations via caching.

Scientific Applications

Room 5

09:00

90min

Introduction to Polars: Fast and Readable Data Analysis

Geir Arne Hjelle

Polars is a new, powerful library for doing analysis on structured data. The library focuses on processing speed and a consistent and intuitive API. This tutorial will help you get started with Polars, by showing you how to read and write data and manipulate it with Polars' powerful expression syntax. You'll learn about how the lazy API is an important key to Polars' efficiency.

Data Science and Visualisation

Room 6

10:30

30min

Break

Room 6

10:30

30min

Break

Room 5

11:00

90min

Combining Python and Rust to create Polars Plugins

Marco Gorelli

Polars is a dataframe library taking the world by storm. It is very runtime and memory efficient and comes with a clean and expressive API. Sometimes, however, the built-in API isn't enough. And that's where its killer feature comes in: plugins. You can extend Polars, and solve practically any problem.

No prior Rust experience required, intermediate Python or general programming experience required. By the end of the session, you will know how to write your own Polars Plugin! This talk is aimed at data practitioners.

Data Science and Visualisation

Room 5

11:00

90min

Using Wikipedia as a language corpus for NLP

Jakub B. Jagiełło

Learning NLP often requires a corpus of sample texts. The common choice is Wikipedia. The project is open source and has huge amounts of a natural language content in dozens of languages. Happily the Wikimedia Foundation publishes data dumps in XML format, which could be easily parsed. In this tutorial you will learn how to do that in Python.

Scientific Applications

Room 6

12:30

90min

Lunch

Room 6

12:30

90min

Lunch

Room 5

14:00

90min

Introduction to Machine Learning with scikit-learn and Pandas

Justyna Szydłowska-Samsel

With Machine Learning becoming a topic of high interest in the scientific community, over the years, many different programming languages and environments have been used for Machine Learning research and system development. Python is known as easy to learn, yet powerful programming languages and has become a popular choice among professionals and amateurs. This tutorial will provide instructions on the usage of two popular Python libraries: Scikit-learn and Pandas, in Machine Learning modeling.

Machine and Deep Learning

Room 6

14:00

90min

Multi-dimensional arrays with Scipp

Mridul Seth

Inspired by xarray, Scipp enriches raw NumPy-like multi-dimensional data arrays by adding named dimensions and associated coordinates. For an even more intuitive and less error-prone user experience, Scipp adds physical units to arrays and their coordinates. Through this tutorial, participants will learn about the basics of modelling their data with the Scipp library and using in built tools in Scipp for scientific data analysis.

One of Scipp's key features is the possibility of using multi-dimensional non-destructive binning to sort record-based "tabular"/"event" data into arrays of bins. This provides fast and flexible binning, rebinning, and filtering operations, all while preserving the original individual records.

Scipp ships with data display and visualization features for Jupyter notebooks, including a powerful plotting interface. Named Plopp, this tool uses a graph of connected nodes to provide interactivity between multiple plots and widgets, requiring only a few lines of code from the user.

Scipp is available via pip and conda and runs on Linux, Mac and Windows.

High Performance Computing

Room 5

15:30

30min

Break

Room 6

15:30

30min

Break

Room 5

16:00

90min

A Hitchhiker's Guide to Contributing to Open Source

Nikoleta E. Glynatsi, Sebastian Berg

Open-source projects are essential for scientific programming. They provide many tools and resources that can be customized for different scientific needs. However, sometimes the existing tools in a package don't meet all the requirements of a project. This is when contributing to open-source packages becomes important. By contributing, you can implement new functionalities, improve the software and help keep the open-source community strong.

This workshop will make contributing to open-source projects easier to understand. It will guide participants from just using the software to actively contributing to it. The workshop will address technical challenges such as interacting with web-based hosting services (like GitHub and GitLab), branching, and opening pull requests. Additionally, it will cover how to contribute documentation and ensure the correctness of the code.

We will use the following repository during the workshop: https://github.com/Nikoleta-v3/HitchCos.

You can find a checklist of prerequisites and installation notes here: https://github.com/Nikoleta-v3/HitchCos/wiki/Prerequisites.

Community, Education, and Outreach

Room 6

16:00

90min

sktime - python toolbox for time series – introduction and new features 2024: foundation models, deep learning backends, probabilistic models, hierarchical demand forecasting, marketplace features

Franz Kiraly, Felipe Angelim, Muhammad Armaghan Shakir, Benedikt Heidrich

sktime is the most widely used scikit-learn compatible framework library for learning with time series. sktime is maintained by a neutral non-profit under permissive license, easily extensible by anyone, and interoperable with the python data science stack.

This tutorial gives a hands-on introduction to sktime, for common time series learning tasks such as forecasting, starting with a general overview of the package and forecasting interfaces for uni- and multivariate forecasts with endo-/exogeneous data, probabilistic forecasts, and forecasting in the presence of hierarchical data.

The tutorial then proceeds to showcase some of the newest features in 2024, based on a hierarchical demand forecasting use case example: support for foundation models, hugging face connectors, advanced support for hierarchical and global forecasts, and integration features for creating API compatible algorithms and sharing them via the sktime discoverability tools.

Machine and Deep Learning

Room 5

09:00

60min

10 Years of Open Source: Navigating the Next AI Revolution

Ines Montani

A lot has been happening in the field of AI and Natural Language Processing: there's endless excitement about new technologies, sobering post-hype hangovers and also uncertainty about where the field is heading next. In this talk, I'll share the most important lessons we've learned in 10 years of working on open source software, our core philosophies that helped us adapt to an ever-changing AI landscape and why open source and interoperability still wins over black-box, proprietary APIs

Community, Education, and Outreach

Room 7

10:00

30min

Coffee Break

Room 7

10:00

30min

Coffee Break

Room 6

10:30

30min

Federated Learning: Where we are and where we need to be

Katharine Jarmul

In this talk, we'll review the landscape of open-source federated learning libraries with a lens on actual real world data problems, use cases and actors who could benefit from federated learning. We'll then analyze gaps, weaknesses and explore new ways we could formulate federated learning problems (and their associated libraries!) to build more useful software and use decentralized machine learning in real world use cases.

Machine and Deep Learning

Room 7

10:30

30min

From stringly typed to strongly typed: Insights from re-designing a library to get the most out of type hints

Janos Gabler

Many scientific Python packages are "stringly typed," i.e., using strings to select algorithms or methods and dictionaries for configuration. While easy for beginners and convenient for authors, these libraries miss out on static typing benefits like error detection before runtime and autocomplete. This talk shares insights from redesigning the optimagic library from the ground up with static typing in mind. Without compromising on simplicity, we achieve better static analysis, autocomplete, and fewer runtime errors. The insights are not specific to numerical optimization and apply to a wide range of scientific Python packages.

Community, Education, and Outreach

Room 6

10:30

45min

OpenGL is dying, let's talk about WebGPU

Almar Klein

OpenGL is old and on a path to being deprecated. Modern GPU API's like Vulkan and Metal solve most problems that plague OpenGL, and higher abstractions like wgpu / WebGPU provide a modern interface to control GPU hardware. The way that these work is much more pleasant to work with, and also provides performance benefits, especially for Python.

Data Science and Visualisation

Room 5

11:05

30min

Helmholtz Blablador and the LLM models' ecosystem

Alexandre Strube

Helmholtz Blablador is the LLM inference server from the Helmholtz Foundation. This talk explores Blablador's role in hosting open-source LLM models and models developed in-house at the Juelich Supercomputing Centre (JSC). This talk is about Blablador and the open source LLM models' ecosystem.

Machine and Deep Learning

Room 7

11:05

30min

Understanding NetworkX's API Dispatching with a parallel backend

Aditi Juneja, Erik Welch

Hi! Have you ever wished your pure Python libraries were faster? Or wanted to fundamentally improve a Python library by rewriting everything in a faster language like C or Rust? Well, wish no more... NetworkX's backend dispatching mechanism redirects your plain old NetworkX function calls to a FASTER implementation present in a separate backend package by leveraging the Python's entry_point specification!

NetworkX is a popular, pure Python library used for graph(aka network) analysis. But when the graph size increases (like a network of everyone in the world), then NetworkX algorithms could take days to solve a simple graph analysis problem. So, to address these performance issues this backend dispatching mechanism was recently developed. In this talk, we will unveil this dispatching mechanism and its implementation details, and how we can use it just by specifying a backend kwarg like this:

>>> nx.betweenness_centrality(G, backend=“parallel”)

or by passing the backend graph object(type-based dispatching):

>>> H = nxp.ParallelGraph(G)
>>> nx.betweenness_centrality(H)

We'll also go over the limitations of this dispatch mechanism. Then we’ll use the example of nx-parallel as a guide to building our own custom NetworkX backend. And then, using NetworkX's existing test suite, we'll test this backend that we build. Ending with a quick dive into the details of the nx-parallel backend.

Community, Education, and Outreach

Room 6

11:15

45min

Scientific Python

Jarrod Millman, Stéfan van der Walt

Learn more about the Scientific Python project (https://scientific.python.org): what it aims to achieve (helping the developer community), recent progress that has been made, and how to become involved.

Community, Education, and Outreach

Room 5

11:40

20min

Data augmentation with Scikit-LLM

Claudio Giorgio Giancaterino

Scikit-LLM is an innovative Python library, seamlessly integrates Large Language Models into the Scikit-Learn framework. Scikit-LLM becomes a powerful tool for natural language processing (NLP) tasks within the Scikit-Learn pipeline, and I'll showcase a data augmentation action to build features using zero-shot text classification and text vectorization.

Machine and Deep Learning

Room 7

11:40

20min

Enhancing Bayesian Optimization with Ensemble Models for Categorical Domains

Ilya Komarov

Bayesian optimization is a powerful technique for optimizing black-box, costly-to-evaluate functions, widely applicable across diverse fields. However, Gaussian process (GP) models commonly used in Bayesian optimization struggle with functions defined on categorical or mixed domains, limiting optimization in scenarios with numerous categorical inputs. In this talk, we present a solution by leveraging ensemble models for probabilistic modelling, providing a robust approach to optimize functions with categorical inputs. We showcase the effectiveness of our method through a Bayesian optimization setup implemented with the BoTorch library, utilizing probabilistic models from the XGBoostLSS framework. By integrating these tools, we achieve efficient optimization on domains with categorical variables, unlocking new possibilities for optimization in practical applications.

Data Science and Visualisation

Room 6

12:00

80min

Lunch Break

Room 7

12:00

80min

Lunch Break

Room 6

13:20

30min

Skrub: prepping tables for machine learning

Guillaume Lemaitre, Vincent Maladiere, Jérôme Dockès

When it comes to designing machine learning predictive models, it is reported that data scientists spend over 80% of their time preparing the data to input to the machine learning algorithm.

Currently, no automated solution exists to address this problem. However, the skrub Python library is here to alleviate some of the daily tasks of data scientists and offer an integration with the scikit-learn machine learning library.

In this talk, we provide an overview of the features available in skrub.

First, we focus on the preprocessing stage closest to the data sources. While predictive models usually expect a single design matrix and a target vector (or matrix), in practice, it is common that data are available from different data tables. It is also possible that the data to be merged are slightly different, making it difficult to join them. We will present the skrub joiners that handle such use cases and are fully compatible with scikit-learn and its pipeline.

Then, another issue widely tackled by data scientists is dealing with heterogeneous data types (e.g., dates, categorical, numerical). We will present the TableVectorizer, a preprocessor that automatically handles different types of encoding and transformation, reducing the amount of boilerplate code to write when designing predictive models with scikit-learn. Like the joiner, this transformer is fully compatible with scikit-learn.

Machine and Deep Learning

Room 7

13:20

30min

The joys and pains of reproducing research: An experiment in bioimaging data analysis

Marianne Corvellec

The conversation about reproducibility is usually focused on how to make research workflows (more) reproducible. Here, we consider it from the opposite perspective, and ask: How feasible is it, in practice, to reproduce research which is meant to be reproducible? Is it even done or attempted? We provide a detailed account of such an attempt, trying to reproduce some segmentation results for 3D microscopy images of a developing mouse embryo. The original research is a monumental work of bioimaging and analysis at the single-cell level, published in Cell in 2018, alongside with all the necessary research artifacts. Did we succeed in this attempt? As we share the joys and pains of this journey, many questions arise: How do reviewers assess the reproducibility claims exactly? Incentivizing reproducible research is still an open problem, since it is so much more costly (in time) to produce. And how can we incentivize those who test reproducibility? Not only is it costly to set up computational environments and execute data-intensive scientific workflows, but it may not appear as rewarding at first thought. In addition, there is a human factor: It is thorny to show authors that their publication does not hold up to their reproducibility claims.

Data Science and Visualisation

Room 6

13:55

20min

From data analysis in Jupyter Notebooks to production applications: AI infrastructure at reasonable scale

Frank Sauerburger

The availability of AI models and packages in the Python ecosystem has revolutionized many applications across domains. This talk discusses infrastructural decisions and best practices that bridge the gap between interactive data analyses in notebooks and production applications at a reasonable scale, suitable for both commercial and scientific contexts. In particular, the talk introduces the on-premises, Python-based AI architecture employed at MDPI, one of the largest open-access publishers. The presentation emphasizes the impact of the design on reproducibility, decoupling of different resources, and ease of use during the development and exploration phases.

Machine and Deep Learning

Room 7

13:55

30min

Mostly Harmless Fixed Effects Regression in Python with PyFixest

Alexander Fischer

This session introduces PyFixest, an open source Python library inspired by the "fixest" R package. PyFixest implements fast routines for the estimation of regression models with high-dimensional fixed effects, including OLS, IV, and Poisson regression. The library also provides tools for robust inference, including heteroscedasticity-robust and cluster robust standard errors, as well as the wild cluster bootstrap and randomization inference. Additionally, PyFixest implements several routines for difference-in-differences estimation with staggered treatment adoption.

PyFixest aims to faithfully replicate the core design principles of "fixest", offering post-estimation inference adjustments, user-friendly syntax for multiple estimations, and efficient post-processing capabilities. By making efficient use of jit-compilation, it is also one of the fastest solutions for regressions with high-dimensional fixed effects.

The presentation will argue why there is a need for another regression package in Python, cover PyFixest's functionality and design philosophy, and discuss future development prospects.

Data Science and Visualisation

Room 6

13:55

50min

[CHANGE OF PROGRAM] Informal discussions about switching build backends

Ralf Gommers

Goals:

Share tips, tricks and best practices for configuring the build backend of a Python package with compiled (Cython/C/C++/Rust/Fortran) code
Identify shared needs between packages, and discuss gaps in current build backends, documentation, or shared infrastructure

Topics:

Goals to aim for in your build config (and how to achieve them):
Faster builds and relevant tooling like profiling,
Build logs that actually help when diagnosing issues,
How to debug build failures effectively,
How to check for and visualize build dependencies,
Ensuring builds are reproducible,
Approaches to reducing binary size,
CI config ideas to guard against regressions
Recent build-related developments & a post-distutils world
What are the most pressing pain points for maintainers?

Scientific Applications

Room 5

14:25

20min

A Qdrant and Specter2 framework for tracking resubmissions of rejected manuscripts in academia

Daniele Raimondi

This presentation introduces a Qdrant vector DB and Specter2 model used to identify whether a rejected academic manuscript is later published in a competing journal. Our method combines AI, data science and analytics to ensure a good identification of manuscripts and authors. The findings offer insights into resubmission patterns, enhancing our understanding of academic publishing dynamics. The system is implemented in Python.

Data Science and Visualisation

Room 7

14:25

20min

Conformal Prediction with MAPIE: A Journey into Reliable Uncertainty Quantification

Claudio Giorgio Giancaterino

In the ever-evolving landscape of data science, accurate uncertainty quantification is crucial for decision-making processes. Conformal Prediction (CP) stands out as a powerful framework for addressing this challenge by providing reliable uncertainty estimates alongside predictions. In this talk, I'll delve into the world of Conformal Prediction, with a focus on the MAPIE Python library, offering a comprehensive understanding of its advantages and practical applications.

Data Science and Visualisation

Room 6