<img loading="lazy" src="/media/euroscipy-2023/img/logo_8lyLvoV.png" id="event-logo" alt="The event’s logo">

Getting started with JupyterLab

Mike Müller

JupyterLab is very widely used in the Python scientific community. Most, if not all, of the other tutorials will use Jupyter as a tool. Therefore, a solid understanding of the basics is very helpful for the rest of the conference as well as for your later daily work.
This tutorial provides an overview of important basic Jupyter features.

08:30

Network Analysis Made Simple (and fast!)

Mridul Seth

Through the use of NetworkX's API, tutorial participants will learn about the basics of graph theory and its use in applied network science. Starting with a computationally-oriented definition of a graph and its associated methods, we will build out into progressively more advanced concepts (path and structure finding). We will also discuss new advances to speed up NetworkX Code with dispatching to alternate computation backends like GraphBLAS. This will be a hands-on tutorial, so stretch your muscles and get ready to go through the exercises!

10:00

30min

Break

Aula

10:00

30min

Break

HS 120

10:30

Introduction to Geospatial Machine Learning with SRAI

Piotr Szymański, Szymon Woźniak, Piotr Gramacki, Kamil Raczycki, Kacper Leśniara

This tutorial offers a thorough introduction to the srai library for Geospatial Artificial Intelligence. Participants will learn how to use this library for geospatial tasks like downloading and processing OpenStreetMap data, extracting features from GTFS data, dividing an area into smaller regions, and representing regions in a vector space using various spatial features. Additionally, participants will learn to pre-train embedding models and train predictive models for downstream tasks.

Introduction to Python for scientific programming

Milton Gomez

This tutorial will provide an introduction to Python intended for beginners.

It will notably introduce the following aspects:

built-in types
controls flow (i.e. conditions, loops, etc.)
built-in functions
basic Python class

12:00

90min

Lunch

Aula

12:00

90min

Lunch

HS 120

13:30

Developing pandas extensions in Rust

Marc Garcia

pandas is a batteries included dataframe library, implementing hundreds of generic operations for tabular data, such as math or string operations, aggregations and window functions... In some case, domain specific code may benefit from user defined functions (UDFs) that implement some particular logic. These functions can sometimes be implemented using more basic pandas vectorized operations, and they will be reasonably fast, but in some others a Python function working with the individual values needs to be implemented, and those will execute orders of magnitude slower than their equivalent vectorized versions. In this tutorial we will see how to implement functions in Rust that can be used with dataframe values at the individual level, but run at the speed of vectorized code, and in some cases faster.

Introduction to NumPy

Geir Arne Hjelle

NumPy is one of the foundational packages for doing data science with Python. It enables numerical computing by providing powerful N-dimensional arrays and a suite of numerical computing tools. In this tutorial, you'll be introduced to NumPy arrays and learn how to create and manipulate them. Then, you'll see some of the tools that NumPy provides, including random number generators and linear algebra routines.

15:00

30min

Break

Aula

15:00

30min

Break

HS 120

15:30

Introduction to Data Analysis Using Pandas

Stefanie Molin

Working with data can be challenging: it often doesn’t come in the best format for analysis, and understanding it well enough to extract insights requires both time and the skills to filter, aggregate, reshape, and visualize it. This session will equip you with the knowledge you need to effectively use pandas – a powerful library for data analysis in Python – to make this process easier.

15:30

Predictive survival analysis with scikit-learn, scikit-survival and lifelines

Olivier Grisel, Vincent Maladiere

This tutorial will introduce how to train machine learning models for time-to-event prediction tasks (health care, predictive maintenance, marketing, insurance...) without introducing a bias from censored training (and evaluation) data.

08:30

Ibis: A fast, flexible, and portable tool for data analytics.

Phillip Cloud, Gil Forsyth

Ibis provides a common dataframe-like interface to many popular databases and analytics tools (BigQuery, Snowflake, Spark, DuckDB, …). This lets users analyze data using the same consistent API, regardless of which backend they’re using, and without ever having to learn SQL. No more pains rewriting pandas code to something else when you run into performance issues; write your code once using Ibis and run it on any supported backend. In this tutorial users will get experience writing queries using Ibis on a number of local and remote database engines.

08:30

Introduction to matplotlib for visualization in Python

Tim Hoffmann

This tutorial explains the fundamental ideas and concepts of matplotlib. It's suited for complete beginners to get started as well as existing users who want to improve their plotting abilities and learn about best practices.

10:00

30min

Break

Aula

10:00

30min

Break

HS 120

10:30

Introduction to scikit-learn

Stefanie Sabine Senger

Update: Here, I provide a prepared jupyter notebook for your to fill with code during the tutorial: https://github.com/StefanieSenger/Talks/blob/main/2023_EuroSciPy/2023_EuroSciPy_Intro_to_scikit-learn_fillout-notebook.ipynb. Please download it and have it at hand when the tutorial starts. You can still download it during the introduction part of the tutorial.

This tutorial will provide a beginner introduction to scikit-learn. Scikit-learn is a Python package for machine learning. We will talk about what Machine Learning is and how scikit-learn can implement it. In the practical part we will learn how to create a predictive modelling pipeline and how to fine tune its hyperparameters to improve the model's score.

PPML: Machine Learning on data you cannot see

Valerio Maggio

Privacy guarantee is the most crucial requirement when it comes to analyse sensitive data. However, data anonymisation techniques alone do not always provide complete privacy protection; moreover Machine Learning models could also be exploited to leak sensitive data when attacked, and no counter-measure is applied. Privacy-preserving machine learning (PPML) methods hold the promise to overcome all these issues, allowing to train machine learning models with full privacy guarantees. In this tutorial we will explore several methods for privacy-preserving data analysis, and how these techniques can be used to safely train ML models without actually seeing the data.

12:00

90min

Lunch

Aula

12:00

90min

Lunch

HS 120

13:30

Generating Data Frames for your test - using Pandas stratgies in Hypothesis

Cheuk Ting Ho

Do you test your data pipeline? Do you use Hypothesis? In this workshop, we will use Hypothesis - a property-based testing framework to generate Pandas DataFrame for your tests, without involving any real data.

Introduction to numerical optimization

Tim Mensinger, Janos Gabler, Tobias Raabe

In this hands-on tutorial, participants will delve into numerical optimization fundamentals and engage with the optimization libraries scipy.optimize and estimagic. estimagic provides a unified interface to many popular libraries such as nlopt or pygmo and provides additional diagnostic tools and convenience features. Throughout the tutorial, participants will get the opportunity to solve problems, enabling the immediate application of acquired knowledge. Topics covered include core optimization concepts, running an optimization with scipy.optimize and estimagic, diagnostic tools, algorithm selection, and advanced features of estimagic, such as bounds, constraints, and global optimization.

15:00

30min

Break

Aula

15:00

30min

Break

HS 120

15:30

From Complex Scientific Notebook to User-Friendly Web Application

Aleksandra Plonska, Piotr Płoński

Learn how to show your work with the MERCURY framework. This open-source tool perfectly matches your computed notebook (e.g., written in Jupyter Notebook). Without knowledge of frontend technologies, you can present your results as a web app (with interactive widgets), report, dashboard, or report. Learn how to improve your notebook and make your work understandable for non-technical mates. Python only!

15:30

Image processing with scikit-image

Guillaume Lemaitre, Joan Massich

This tutorial explores scikit-image, the numpy-native library in the scientific python ecosystem, for visual data analysis and manipulation.
Designed for beginners and advanced users, it empowers image analysis skills and offers insights into scikit-image documentation.

It covers basic concepts like image histogram, contrast, filtering, segmentation, and descriptors through practical exercises.
The tutorial concludes with advanced performance optimization techniques.

Familiarity with numpy arrays is essential as it the underlying data representation.

Integrating Ethics in ML: From Philosophical Foundations to Practical Implementations

08:30

30min

Registration

Aula

09:00

60min

Giada Pistilli

In the rapidly evolving landscape of Machine Learning (ML), significant advancements like Large Language Models (LLMs) are gaining critical importance in both industrial and academic spheres. However, the rush towards deploying advanced models harbors inherent ethical tensions and potential adverse societal impacts. The keynote will start with a brief introduction to the principles of ethics, viewed through the lens of philosophy, emphasizing how these fundamental concepts find application within ML. Grounding our discussion in tangible realities, we will delve into pertinent case studies, including the BigScience open science initiative, elucidating the practical application of ethical considerations. Additionally, the keynote will touch upon findings from my recent research, which investigates the synergy between ethical charters, legal tools, and technical documentation in the context of ML development and deployment.

10:00

30min

Break

Aula

10:00

30min

Break

HS 120

10:30

Anomaly Detection in Time Series: Techniques, Tools and Tricks

Vadim Nelidov

From sensor data to epidemic outbreaks, particle dynamics to environmental monitoring, much of crucial real world data has temporal nature. Fundamental challenges facing data specialist dealing with time series include not only predicting the future values, but also determining when these values are alarming. Standard anomaly detection algorithms and common rule-based heuristics often fall short in addressing this problem effectively. In this talk, we will closely examine this domain, exploring its unique characteristics and challenges. You will learn to apply some of the most promising techniques for detecting time series anomalies as well as relevant scientific Python tools that can help you with it.

Contributor, Developer and Volunteer Experience: Navigating Challenges Beyond Code

Giada Pistilli, Cheuk Ting Ho, Maren Westermann, Stefania Delprete

Let's Talk Inclusivity and Mental Health.

What's beyond the lines of code? Let's explore the spectrum of experiences, from contributors to volunteers, developers to conference attendees.

Join us to share your insights, experiences, and solutions for a more supportive and inclusive scientific Python ecosystem. Let's empower one another and shape a community that thrives on empathy, understanding, and collaboration.

Community, Education, and Outreach

HS 119 - Maintainer track

Ibis: Because SQL is everywhere but you don't want to use it

Phillip Cloud, Gil Forsyth

We love to use Python in our day jobs, but that enterprise database you run your ETL job against may have other ideas. It probably speaks SQL, because SQL is ubiquitous, it’s been around for a while, it’s standardized, and it’s concise.
But is it really standardized? And is it always concise? No!

Do we still need to use it? Probably!

What’s a data-person to do? String-templated SQL?
print(f”That way lies {{ m̴͕̰̻̏́ͅa̸̟̜͉͑d̵̨̫̑n̵̖̲̒͑̾e̸̘̼̭͌s̵͇̖̜̽s̸̢̲̖͗͌̏̊͜ }}”.)

Instead, come and learn about Ibis! It offers a dataframe-like interface to construct concise and composable queries and then executes them against a wide variety of backends (Postgres, DuckDB, Spark, Snowflake, BigQuery, you name it.).

11:05

Get the best from your scikit-learn classifier: trusted probabilties and optimal binary decision

Guillaume Lemaitre

When operating a classifier in a production setting (i.e. predictive phase), practitioners are interested in potentially two different outputs: a "hard" decision used to leverage a business decision or/and a "soft" decision to get a confidence score linked to each potential decision (e.g. usually related to class probabilities).

Scikit-learn does not provide any flexibility to go from "soft" to "hard" predictions: it uses a cut-off point at a confidence score of 0.5 (or 0 when using decision_function) to get class labels. However, optimizing a classifier to get a confidence score close to the true probabilities (i.e. a calibrated classifier) does not guarantee to obtain accurate "hard" predictions using this heuristic. Reversely, training a classifier for an optimum "hard" prediction accuracy (with the cut-off constraint at 0.5) does not guarantee obtaining a calibrated classifier.

In this talk, we will present a new scikit-learn meta-estimator allowing us to get the best of the two worlds: a calibrated classifier providing optimum "hard" predictions. This meta-estimator will land in a future version of scikit-learn: https://github.com/scikit-learn/scikit-learn/pull/26120.

We will provide some insights regarding the way to obtain accurate probabilities and predictions and also illustrate how to use in practice this model on different use cases: cost-sensitive problems and imbalanced classification problems.

11:05

Joris Van den Bossche, Richard Shadrach

Pandas 2.0 and beyond

Pandas has reached a 2.0 milestone in 2023. But what does that mean? And what is coming after 2.0? This talk will give an overview of what happened in the latest releases of pandas and highlight some topics and major new features the pandas project is working on

11:40

DataFrame-agnostic code: are we there yet?

Marco Gorelli

Have you ever wanted to write a DataFrame-agnostic function, which should perform the same operation regardless of whether the input is pandas / polars / something else? Did you get stuck with special-casing to handle all the different APIs? All is good, the DataFrame Standard is here to help!

11:40

GPT generated text detection: problems and solution in the scientific publishing

Dr. Milos Cuculovic, Andrea Guzzo

Since its release, ChatGPT is now widely adopted as "the" text generation tool used across all industries and businesses. This also includes the domain of scientific research where we do observe more and more scientific papers partially or even fully generated by AI. The same also applies to the peer-reviews reports created while reviewing a paper.

What are the guidelines in the scientific research world? What is now the meaning of the written word and how do we build a model that can identify whether a text is AI-generated? What are the potential solutions to solve this important issue?

Within this talk, we are discussing on how to detect AI-generated text and how to create a scalable architecture integrating this tool.

Sparse Data in the Scientific Python Ecosystem: Current Needs, Recent Work, and Future Improvements

12:00

90min

Lunch

Aula

12:00

90min

Lunch

HS 120

13:30

45min

Julien Jerphanion

This maintainer track aims to lead discussions about the current needs for sparse data in the scientific python Ecosystem. It will present achievements and pursuit of the work initiated in the first Scientific Python Developer Summit, which took from 22nd May to 28th May 2023.

HS 119 - Maintainer track

Timing and Benchmarking Scientific Python

Kai Striega

Scientific code is often complex, resource-intensive, and sensitive to performance issues, making accurate timing and benchmarking critical for optimising performance and ensuring reproducibility. However, benchmarking scientific code presents several challenges, including variability in input data, hardware and software dependencies, and optimisation trade-offs. In this talk, I discuss the importance of timing and benchmarking for scientific code and outline strategies for addressing these challenges. Specifically, I emphasise the need for representative input data, controlled benchmarking environments, appropriate metrics, and careful documentation of the benchmarking process. By following these strategies, developers can effectively optimise code performance, select efficient algorithms and data structures, and ensure the reliability and reproducibility of scientific computations.

Why I Follow CI/CD Principles When Writing Code: Building Robust and Reproducible Applications

Artem Kislovskiy

This talk will discuss the importance of Continuous Integration and Continuous Delivery (CI/CD) principles in the development of scientific applications, with a focus on creating robust and reproducible code that can withstand rigorous testing and scrutiny. The presentation will cover best practices for project structure and code organization, as well as strategies for ensuring reproducibility, collaboration, and managing dependencies. By implementing CI/CD principles in scientific application development processes, researchers can improve efficiency, reliability, and maintainability, ultimately accelerating research.

14:05

Accelerating your Python code - a systematic overview

Tim Hoffmann

Python is slow. We feel the performance limitations when doing computationally intensive work. There are many libraries and methods to accelerate your computations, but which way to go? This talk serves as a navigation guide through the world of speeding up Python. At the end, you should have a high-level understanding of performance aspects and know which way to go when you want to speed up your code next time.

14:05

Solara: A Pure Python, React-style Framework for Scaling Your Data Apps

Maarten Breddels

Solara is a pure Python web framework designed to scale complex applications. Leveraging a React-like API, Solara offers the scalability, component-based coding, and simple state management that have made React a standard for large web applications. Solara uses a pure Python implementation of React, Reacton, to create ipywidgets-based applications that work both in the Jupyter Notebook environment and as standalone web apps with frameworks like FastAPI. This talk will explore the design principles of Solara, illustrate its potential with case studies and live examples, and provide resources for attendees to incorporate Solara into their own projects. Whether you're a researcher developing interactive visualizations or a data scientist building complex web applications, Solara provides a Python-centric solution for scaling your projects effectively.

What-not to expect from NumPy 2.0

14:15

45min

Sebastian Berg

NumPy is planning a 2.0 release early next year replacing the 1.X release. While we hope that the release will not be disruptive to most users we do plan some larger changes that may affect many. These changes include modifications to the Python and C-API, for example making the NumPy promotion rules more consistent around scalar values.

HS 119 - Maintainer track

14:40

Chalk’it: an open-source framework for rapid web applications

Mongi BEN GAID

Chalk'it is an open-source framework that transforms Python scripts into distributable web app dashboards. It utilizes drag-and-drop widgets to establish an interface linked to a dataflow connecting Python code and various data sources. Chalk'it supports multiple Python graphics libraries, including Plotly, Matplotlib and Folium for interactive mapping and visualization. The framework operates entirely in web browsers using Pyodide. In our presentation, we will showcase Chalk'it, emphasizing its primary features, software architecture, and key applications, with a special focus on geospatial data visualization.

14:40

Estimagic: A library that enables scientists and engineers to solve challenging numerical optimization problems

Janos Gabler

estimagic is a Python package for nonlinear optimization with or without constraints. It is particularly suited to solving difficult nonlinear estimation problems. On top, it provides functionality to perform statistical inference on estimated parameters.

In this presentation, we give a tour through estimagic's most notable features and explain its position in the ecosystem of Python libraries for numerical optimization.

Community, Education, and Outreach