To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
08:30
08:30
90min
Registration
Kuppelsaal
10:00
10:00
45min
Opening Session
Kuppelsaal
10:50
10:50
90min
Accelerate Python with Julia
Stephan Sahm

Speeding up Python code has traditionally been achieved by writing C/C++ — an alien world for most Python users. Today, you can write high performance code in Julia instead, which is much much easier for Python users. This tutorial will give you hands-on experience writing a Python library that incorporates Julia for performance optimization.

General: Others
A05-A06
10:50
45min
Apache StreamPipes for Pythonistas: IIoT data handling made easy!
Tim Bossenmaier, Sven Oehler

The industrial environment offers a lot of interesting use cases for data enthusiasts. There are myriads of interesting challenges that can be solved by data scientists.
However, collecting industrial data in general and industrial IoT (IIoT) data in particular, is cumbersome and not really appealing for anyone who just wants to work with data.
Apache StreamPipes addresses this pitfall and allows anyone to extract data from IIoT data sources without messing around with (old-fashioned) protocols. In addition, StreamPipes newly developed Python client now gives Pythonistas the ability to programmatically access and work with them in a Pythonic way.

This talk will provide a basic introduction into the functionality of Apache StreamPipes itself, followed by a deeper discussion of the Python client. Finally, a live demo will show how IIoT data can be easily derived in Python and used directly for visualization and ML model training.

PyData: Data Handling
B09
10:50
30min
Cooking up a ML Platform: Growing pains and lessons learned
Cole Bailey

What is a ML platform and do you even need one? When should you consider investing in your own ML platform? What challenges can you expect building and maintaining one? Tune in and discover (some) answers to these questions and more! I will share a first-hand account of our ongoing journey towards becoming a ML platform team within Delivery Hero's Logistics department, including how we got here, how we structure our work, and what challenges and tools we are focussing on next.

Sponsor
B07-B08
10:50
45min
From notebook to pipeline in no time with LineaPy
Thomas Fraunholz

The nightmare before data science production: You found a working prototype for your problem using a Jupyter notebook and now it's time to build a production grade solution from that notebook. Unfortunately, your notebook looks anything but production grade. The good news is, there's finally a cure!

The open-source python package LineaPy aims to automate data science workflow generation and expediting the process of going from data science development to production. And truly, it transforms messy notebooks into data pipelines like Apache Airflow, DVC, Argo, Kubeflow, and many more. And if you can't find your favorite orchestration framework, you are welcome to work with the creators of LineaPy to contribute a plugin for it!

In this talk, you will learn the basic concepts of LineaPy and how it supports your everyday tasks as a data practitioner. For this purpose, we will transform a notebook step by step together to create a DVC pipeline. Finally, we will discuss what place LineaPy will take in the MLOps universe. Will you only have to check in your notebook in the future?

PyCon: DevOps & MLOps
A1
10:50
45min
Honey, I broke the PyTorch model >.< - Debugging custom PyTorch models in a structured manner
Clara Hoffmann

When building PyTorch models for custom applications from scratch there's usually one problem: The model does not learn anything. In a complex project, it can be tricky to identify the cause: Is it the data? A bug in the model? Choosing the wrong loss function at 3 am after an 8-hour coding session?

In this talk, we will build a toolbox to find the culprits in a structured manner. We will focus on simple ways to ensure a training loop is correct, generate synthetic training data to determine whether we have a model bug or problematic real-world data, and leverage pytest to safely refactor PyTorch models.

After this talk, visitors will be well equipped to take the right steps when a model is not learning, quickly identify the underlying reasons, and prevent bugs in the future.

PyData: Deep Learning
B05-B06
10:50
90min
How to teach NLP to a newbie & get them started on their first project
Lisa Andreevna Chalaguine

The materials presented during this tutorial are open source and can be used by coaches and tutors who want to teach their students how to use Python for text processing and text classification. (A minimal understanding of programming (in any language) is required by the students)

PyData: Natural Language Processing
A03-A04
10:50
45min
Pandas 2.0 and beyond
Joris Van den Bossche, Patrick Hoefler

Pandas has reached a 2.0 milestone in 2023. But what does that mean? And what is coming after 2.0? This talk will give an overview of what happened in the latest releases of pandas and highlight some topics and major new features the pandas project is working on.

PyData: PyData & Scientific Libraries Stack
Kuppelsaal
11:40
11:40
45min
An unbiased evaluation of environment management and packaging tools
Anna-Lena Popkes

Python packaging is quickly evolving and new tools pop up on a regular basis. Lots of talks and posts on packaging exist but none of them give a structured, unbiased overview of the available tools.

This talk will shed light on the jungle of packaging and environment management tools, comparing them on a basis of predefined features.

PyCon: Programming & Software Engineering
Kuppelsaal
11:40
45min
AutoGluon: AutoML for Tabular, Multimodal and Time Series Data
Caner Turkmen, Oleksandr Shchur

AutoML, or automated machine learning, offers the promise of transforming raw data into accurate predictions with minimal human intervention, expertise, and manual experimentation. In this talk, we will introduce AutoGluon, a cutting-edge toolkit that enables AutoML for tabular, multimodal and time series data. AutoGluon emphasizes usability, enabling a wide variety of tasks from regression to time series forecasting and image classification through a unified and intuitive API. We will specifically focus on tasks on tabular and time series tasks where AutoGluon is the current state-of-the-art, and demonstrate how AutoGluon can be used to achieve competitive performance on tabular and time series competition data sets. We will also discuss the techniques used to automatically build and train these models, peeking under the hood of AutoGluon.

PyData: Machine Learning & Stats
B05-B06
11:40
45min
Hyperparameter optimization for the impatient
Martin Wistuba

In the last years, Hyperparameter Optimization (HPO) became a fundamental step in the training
of Machine Learning (ML) models and in the creation of automatic ML pipelines.
Unfortunately, while HPO improves the predictive performance of the final model, it comes with a significant cost both in terms of computational resources and waiting time.
This leads many practitioners to try to lower the cost of HPO by employing unreliable heuristics.

In this talk we will provide simple and practical algorithms for users that want to train models
with almost-optimal predictive performance, while incurring in a significantly lower cost and waiting
time. The presented algorithms are agnostic to the application and the model being trained so they can be useful in a wide range of scenarios.

We provide results from an extensive experimental activity on public benchmarks, including comparisons with well-known techniques like Bayesian Optimization (BO), ASHA, Successive Halving.
We will describe in which scenarios the biggest gains are observed (up to 30x) and provide examples for how to use these algorithms in a real-world environment.

All the code used for this talk is available on (GitHub)[https://github.com/awslabs/syne-tune].

PyData: Machine Learning & Stats
B09
11:40
45min
Incorporating GPT-3 into practical NLP workflows
Ines Montani

In this talk, I'll show how large language models such as GPT-3 complement rather than replace existing machine learning workflows. Initial annotations are gathered from the OpenAI API via zero- or few-shot learning, and then corrected by a human decision maker using an annotation tool. The resulting annotations can then be used to train and evaluate models as normal. This process results in higher accuracy than can be achieved from the OpenAI API alone, with the added benefit that you'll own and control the model for runtime.

PyData: Natural Language Processing
A1
11:40
45min
Large Scale Feature Engineering and Datascience with Python & Snowflake
Michael Gorkow

Snowflake as a data platform is the core data repository of many large organizations.
With the introduction of Snowflake's Snowpark for Python, Python developers can now collaborate and build on one platform with a secure Python sandbox, providing developers with dynamic scalability & elasticity as well as security and compliance.

In this talk I'll explain the core concepts of Snowpark for Python and how they can be used for large scale feature engineering and data science.

Sponsor
B07-B08
12:25
12:25
75min
Lunch
Kuppelsaal
12:25
75min
Lunch
B09
12:25
75min
Lunch
B07-B08
12:25
75min
Lunch
B05-B06
12:25
75min
Lunch
A1
12:25
75min
Lunch
A03-A04
12:25
75min
Lunch
A05-A06
13:40
13:40
15min
Announcements 15min
Kuppelsaal
13:55
13:55
45min
Keynote - A journey through 4 industries with Python: Python's versatile problem-solving toolkit
Susan Shu Chang

In this keynote, I will share the lessons learned from using Python in 4 industries. Apart from machine learning applications that I build in my day to day as a data scientist and machine learning engineer, I also use Python to develop games for my own gaming company, Quill Game Studios. There is a lot of versatility in Python, and it's been my pleasure to use it to solve many interesting problems. I hope that this talk can give inspiration to various types of applications in your own industry as well.

Plenary
Kuppelsaal
14:40
14:40
30min
Coffee Break
Kuppelsaal
14:40
30min
Coffee Break
B09
14:40
30min
Coffee Break
B07-B08
14:40
30min
Coffee Break
B05-B06
14:40
30min
Coffee Break
A1
14:40
30min
Coffee Break
A03-A04
14:40
30min
Coffee Break
A05-A06
15:10
15:10
30min
Common issues with Time Series data and how to solve them
Vadim Nelidov

Time-series data is all around us: from logistics to digital marketing, from pricing to stock
markets. It’s hard to imagine a modern business that has no time series data to forecast.
However, mastering such forecasting is not an easy task.
For this talk, together with other domain experts, I have collected a list of common time
series issues that data professionals commonly run into. After this talk, you will learn to
identify, understand, and resolve such issues. This will include stabilising divergent time
series, organising delayed / irregular data, handling missing values without anomaly propagation,
and reducing the impact of noise and outliers on your forecasting models.

PyData: Data Handling
B05-B06
15:10
30min
Exploring the Power of Cyclic Boosting: A Pure-Python, Explainable, and Efficient ML Method
Felix Wick

We have recently open-sourced a pure-Python implementation of Cyclic Boosting, a family of general-purpose, supervised machine learning algorithms. Its predictions are fully explainable on individual sample level, and yet Cyclic Boosting can deliver highly accurate and robust models. For this, it requires little hyperparameter tuning and minimal data pre-processing (including support for missing information and categorical variables of high cardinality), making it an ideal off-the-shelf method for structured, heterogeneous data sets. Furthermore, it is computationally inexpensive and fast, allowing for rapid improvement iterations. The modeling process, especially the infamous but unavoidable feature engineering, is facilitated by automatic creation of an extensive set of visualizations for data dependencies and training results. In this presentation, we will provide an overview of the inner workings of Cyclic Boosting, along with a few sample use cases, and demonstrate the usage of the new Python library.

You can find Cyclic Boosting on GitHub: https://github.com/Blue-Yonder-OSS/cyclic-boosting

Sponsor
B09
15:10
30min
How to baseline in NLP and where to go from there
Tobias Sterbak

In this talk, we will explore the build-measure-learn paradigm and the role of baselines in natural language processing (NLP). We will cover the common NLP tasks of classification, clustering, search, and named entity recognition, and describe the baseline approaches that can be used for each task. We will also discuss how to move beyond these baselines through weak learning and transfer learning. By the end of this talk, attendees will have a better understanding of how to establish and improve upon baselines in NLP.

PyData: Natural Language Processing
A03-A04
15:10
90min
Practical Session: Learning on Heterogeneous Graphs with PyG
Ramona Bendias, Matthias Fey

Learn how to build and analyze heterogeneous graphs using PyG, a machine graph learning library in Python. This workshop will provide a practical introduction to the concept of heterogeneous graphs and their applications, including their ability to capture the complexity and diversity of real-world systems. Participants will gain experience in creating a heterogeneous graph from multiple data tables, preparing a dataset, and implementing and training a model using PyG.

PyCon: Libraries
A05-A06
15:10
30min
Raised by Pandas, striving for more: An opinionated introduction to Polars
Nico Kreiling

Pandas is the de-facto standard for data manipulation in python, which I personally love for its flexible syntax and interoperability. But Pandas has well-known drawbacks such as memory in-efficiency, inconsistent missing data handling and lacking multicore-support. Multiple open-source projects aim to solve those issues, the most interesting is Polars.

Polars uses Rust and Apache Arrow to win in all kinds of performance-benchmarks and evolves fast. But is it already stable enough to migrate an existing Pandas' codebase? And does it meet the high-expectations on query language flexibility of long-time Pandas-lovers?

In this talk, I will explain, how Polars can be that fast, and present my insights on where Polars shines and in which scenarios I stay with pandas (at least for now!)

PyData: Data Handling
Kuppelsaal
15:10
30min
Staying Alert: How to Implement Continuous Testing for Machine Learning Models
Emeli Dral

Proper monitoring of machine learning models in production is essential to avoid performance issues. Setting up monitoring can be easy for a single model, but it often becomes challenging at scale or when you face alert fatigue based on many metrics and dashboards.

In this talk, I will introduce the concept of test-based ML monitoring. I will explore how to prioritize metrics based on risks and model use cases, integrate checks in the prediction pipeline and standardize them across similar models and model lifecycle. I will also take an in-depth look at batch model monitoring architecture and the use of open-source tools for setup and analysis.

PyCon: DevOps & MLOps
A1
15:10
30min
The CPU in your browser: WebAssembly demystified
Antonio Cuni

In the recent years we saw an explosion of usage of Python in the browser:
Pyodide, CPython on WASM, PyScript, etc. All of this is possible thanks to the
powerful functionalities of the underlying platform, WebAssembly, which is essentially a virtual CPU
inside the browser.

General: Python & PyData Friends
B07-B08
15:45
15:45
30min
A concrete guide to time-series databases with Python
Ellen König, Heiner Tholen

We evaluated time-series databases and complementary services to stream-process sensor data. In this talk, our evaluation will be presented. The final implementation will be shown, alongside python-tools we’ve built and lessons learned during the process.

Sponsor
B09
15:45
30min
Driving down the Memray lane - Profiling your data science work
Cheuk Ting Ho

When handling a large amount of data, memory profiling the data science workflow becomes more important. It gives you insight into which process consumes lots of memory. In this talk, we will introduce Mamray, a Python memory profiling tool and its new Jupyter plugin.

PyData: Jupyter
A1
15:45
30min
Have your cake and eat it too: Rapid model development and stable, high-performance deployments
Christian Bourjau, Jakub Bachurski

At the boundary of model development and MLOps lies the balance between the speed of deploying new models and ensuring operational constraints. These include factors like low latency prediction, the absence of vulnerabilities in dependencies and the need for the model behavior to stay reproducible for years. The longer the list of constraints, the longer it usually takes to take a model from its development environment into production. In this talk, we present how we seemingly managed to square the circle and have both a rapid, highly dynamic model development and yet also a stable and high-performance deployment.

Sponsor
B07-B08
15:45
45min
Performing Root Cause Analysis with DoWhy, a Causal Machine-Learning Library
Patrick Blöbaum

In this talk, we will introduce the audience to DoWhy, a library for causal machine-learning (ML). We will introduce typical problems where causal ML can be applied and will specifically do a deep dive on root cause analysis using DoWhy. To do this, we will lay out what typical problem spaces for causal ML look like, what kind of problems we're trying to solve, and then show how to use DoWhy's API to solve these problems. Expect to see a lot of code with a hands-on example. We will close this session by zooming out a bit and also talk about the PyWhy organization governing DoWhy.

PyData: Machine Learning & Stats
A03-A04
15:45
30min
Polars - make the switch to lightning-fast dataframes
Thomas Bierhance

In this talk, we will report on our experiences switching from Pandas to Polars in a real-world ML project. Polars is a new high-performance dataframe library for Python based on Apache Arrow and written in Rust. We will compare the performance of polars with the popular pandas library, and show how polars can provide significant speed improvements for data manipulation and analysis tasks. We will also discuss the unique features of polars, such as its ability to handle large datasets that do not fit into memory, and how it feels in practice to make the switch from Pandas. This talk is aimed at data scientists, analysts, and anyone interested in fast and efficient data processing in Python.

PyData: Data Handling
Kuppelsaal
15:45
30min
WALD: A Modern & Sustainable Analytics Stack
Florian Wilhelm

The name WALD-stack stems from the four technologies it is composed of, i.e. a cloud-computing Warehouse like Snowflake or Google BigQuery, the open-source data integration engine Airbyte, the open-source full-stack
BI platform Lightdash, and the open-source data transformation tool DBT.

Using a Formula 1 Grand Prix dataset, I will give an overview of how these four tools complement each other perfectly for analytics tasks in an ELT approach. You will learn the specific uses of each tool as well as their particular features. My talk is based on a full tutorial, which you can find under waldstack.org.

PyData: Data Handling
B05-B06
16:20
16:20
30min
BHAD: Explainable unsupervised anomaly detection using Bayesian histograms
Alexander Vosseler

The detection of outliers or anomalous data patterns is one of the most prominent machine learning use cases in industrial applications. I present a Bayesian histogram anomaly detector (BHAD), where the number of bins is treated as an additional unknown model parameter with an assigned prior distribution. BHAD scales linearly with the sample size and enables a straightforward explanation of individual scores, which makes it very suitable for industrial applications when model interpretability is crucial. I study the predictive performance of the proposed BHAD algorithm with various SoA anomaly detection approaches using simulated data and also using popular benchmark datasets for outlier detection. The reported results indicate that BHAD has very competitive predictive accuracy
compared to other more complex and computationally more expensive algorithms, while being explainable and fast.

PyData: Machine Learning & Stats
B05-B06
16:20
30min
Building a Personal Assistant With GPT and Haystack: How to Feed Facts to Large Language Models and Reduce Hallucination.
Mathis Lucka

Large Language Models (LLM), like ChatGPT, have shown miraculous performances on various tasks. But there are still unsolved issues with these models: they can be confidently wrong and their knowledge becomes outdated. GPT also does not have any of the information that you have stored in your own data. In this talk, you'll learn how to use Haystack, an open source framework, to chain LLMs with other models and components to overcome these issues. We will build a practical application using these techniques. And you will walk away with a deeper understanding of how to use LLMs to build NLP products that work.

PyData: Natural Language Processing
B07-B08
16:20
30min
FastAPI and Celery: Building Reliable Web Applications with TDD
Avanindra Kumar Pandeya

In this talk, we will explore how to use the FastAPI web framework and Celery task queue to build reliable and scalable web applications in a test-driven manner. We will start by setting up a testing environment and writing unit tests for the core functionality of our application. Next, we will use FastAPI to create an api to perform some long-running task. Finally, we will then see how Celery can help us offload long-running tasks and improve the performance of our application.
By the end of this talk, attendees will have a strong understanding of TDD and how to apply it to your FastAPI and Celery projects, and you will be able to write tests that ensure the reliability and maintainability of your code.

PyCon: Testing
Kuppelsaal
16:20
30min
How to build observability into a ML Platform
Alicia Bargar

As machine learning becomes more prevalent across nearly every business and industry, making sure that these technologies are working and delivering quality is critical. In her talk, Alicia will discuss the importance of machine learning observability and why it should be a fundamental tool of modern machine learning architectures. Not only does it ensure models are accurate, but it helps teams iterate and improve models quicker. Alicia will dive into how Shopify has been prototyping building observability into different parts of its machine learning platform. This talk will provide insights on how to track model performance, how to catch any unexpected or erroneous behaviour, what types of behavior to look for in your data (e.g. drift, quality metrics) and in your model/predictions, and how observability could work with large language models and Chat AIs.

Sponsor
B09
16:20
30min
Specifying behavior with Protocols, Typeclasses or Traits. Who wears it better (Python, Scala 3, Rust)?
Kolja Maier

In this talk, we will explore the use of Python's typing.Protocol, Scala's Typeclasses, and Rust's Traits.
They all offer a very powerful & elegant mechanism for abstracting over various concepts (such as Serialization) in a modular manner.
We will compare and contrast the syntax and implementation of these constructs in each language and discuss their strengths and weaknesses. We will also look at real-world examples of how these features are used in each language to specify behavior, and consider differences in terms of type system expressiveness and effectiveness. By the end of the talk, attendees will have a better understanding of the differences and similarities between these three language features, and will be able to make informed decisions about which one is best suited for their needs.

PyCon: Python Language
A1
16:50
16:50
30min
Coffee Break
Kuppelsaal
16:50
30min
Coffee Break
B09
16:50
30min
Coffee Break
B07-B08
16:50
30min
Coffee Break
B05-B06
16:50
30min
Coffee Break
A1
16:50
30min
Coffee Break
A03-A04
16:50
30min
Coffee Break
A05-A06
17:20
17:20
60min
Lightning Talks 60Min
Kuppelsaal
09:00
09:00
15min
Announcements 15Min
Kuppelsaal
09:15
09:15
45min
Keynote - How Are We Managing? Data Teams Management IRL
Noa Tamir

The title “Data Scientist” has been in use for 15 years now. We have been attending PyData conferences for over 10 years as well. The hype around data science and AI seems higher than ever before. But How are we managing?

Plenary
Kuppelsaal
10:00
10:00
30min
Coffee Break
Kuppelsaal
10:00
30min
Coffee Break
B09
10:00
30min
Coffee Break
B07-B08
10:00
30min
Coffee Break
B05-B06
10:00
30min
Coffee Break
A1
10:00
30min
Coffee Break
A03-A04
10:00
30min
Coffee Break
A05-A06
10:30
10:30
90min
Aspect-oriented Programming - Diving deep into Decorators
Mike Müller

The aspect-oriented programming paradigm can support the separation of
cross-cutting concerns such as logging, caching, or checking of permissions.
This can improve code modularity and maintainability.
Python offers decorator to implement re-usable code for cross-cutting task.

This tutorial is an in-depth introduction to decorators.
It covers the usage of decorators and how to implement simple and more advanced
decorators.
Use cases demonstrate how to work with decorators.
In addition to showing how functions can use closures to create decorators,
the tutorial introduces callable class instance as alternative.
Class decorators can solve problems that use be to be tasks for metaclasses.
The tutorial provides uses cases for class decorators.

While the focus is on best practices and practical applications, the tutorial
also provides deeper insight into how Python works behind the scene.
After the tutorial participants will feel comfortable with functions that take
functions and return new functions.

PyCon: Programming & Software Engineering
A05-A06
10:30
30min
Bayesian Marketing Science: Solving Marketing's 3 Biggest Problems
Dr. Thomas Wiecki

In this talk I will present two new open-source packages that make up a powerful and state-of-the-art marketing analytics toolbox. Specifically, PyMC-Marketing is a new library built on top of the popular Bayesian modeling library PyMC. PyMC-Marketing allows robust estimation of customer acquisition costs (via media mix modeling) as well as customer lifetime value.
In addition, I will show how we can estimate the effectiveness of marketing campaigns using a new Bayesian causal inference package called CausalPy. The talk will be applied with a real-world case-study and many code examples. Special emphasis will be placed on the interplay between these tools and how they can be combined together to make optimal marketing budget decisions in complex scenarios.

PyData: Machine Learning & Stats
A1
10:30
90min
Geospatial Data Processing with Python: A Comprehensive Tutorial
Martin Christen

In this tutorial, you will learn about the various Python modules for processing geospatial data, including GDAL, Rasterio, Pyproj, Shapely, Folium, Fiona, OSMnx, Libpysal, Geopandas, Pydeck, Whitebox, ESDA, and Leaflet. You will gain hands-on experience working with real-world geospatial data and learn how to perform tasks such as reading and writing spatial data, reprojecting data, performing spatial analyses, and creating interactive maps. This tutorial is suitable for beginners as well as intermediate Python users who want to expand their knowledge in the field of geospatial data processing

PyData: PyData & Scientific Libraries Stack
A03-A04
10:30
30min
Improving Machine Learning from Human Feedback
Erin Mikail Staples, Nikolai

Large generative models rely upon massive data sets that are collected automatically. For example, GPT-3 was trained with data from “Common Crawl” and “Web Text”, among other sources. As the saying goes — bigger isn’t always better. While powerful, these data sets (and the models that they create) often come at a cost, bringing their “internet-scale biases” along with their “internet-trained models.” While powerful, these models beg the question — is unsupervised learning the best future for machine learning?

ML researchers have developed new model-tuning techniques to address the known biases within existing models and improve their performance (as measured by response preference, truthfulness, toxicity, and result generalization). All of this at a fraction of the initial training cost. In this talk, we will explore these techniques, known as Reinforcement Learning from Human Feedback (RLHF), and how open-source machine learning tools like PyTorch and Label Studio can be used to tune off-the-shelf models using direct human feedback.

PyData: Machine Learning & Stats
B07-B08
10:30
30min
Software Design Pattern for Data Science
Theodore Meynard

Even if every data science work is special, a lot can be learned from similar problems solved in the past. In this talk, I will share some specific software design concepts that data scientists can use to build better data products.

PyCon: DevOps & MLOps
B09
10:30
30min
The State of Production Machine Learning in 2023
Alejandro Saucedo

As the number of production machine learning use-cases increase, we find ourselves facing new and bigger challenges where more is at stake. Because of this, it's critical to identify the key areas to focus our efforts, so we can ensure our machine learning pipelines are reliable and scalable. In this talk we dive into the state of production machine learning in the Python Ecosystem, and we will cover the concepts that make production machine learning so challenging, as well as some of the recommended tools available to tackle these challenges.

This talk will cover key principles, patterns and frameworks around the open source frameworks powering single or multiple phases of the end-to-end ML lifecycle, incluing model training, deploying, monitoring, etc. We will be covering a high level overview of the production ML ecosystem and dive into best practices that have been abstracted from production use-cases of machine learning operations at scale, as well as how to leverage tools to that will allow us to deploy, explain, secure, monitor and scale production machine learning systems.

PyCon: DevOps & MLOps
B05-B06
10:30
30min
What could possibly go wrong? - An incomplete guide on how to prevent, detect & mitigate biases in data products
Lea Petters

Within this talk, I want to look at the topic of data ethics with a practical lens and facilitate the discussion about how we can establish ethical data practices into our day to day work. I will shed some light on the multiple sources of biases in data applications: Where are potential pitfalls and how can we prevent, detect and mitigate them early so they never become a risk for our data product. I will walk you through the different stages of a data product lifecycle and dive deeper into the questions we as data professionals have to ask ourselves throughout the process. Furthermore, I will present methods, tools and libraries that can support our work. Being well aware that there is no universal solution as tools and strategies need to be chosen to specifically address requirements of the use-case and models at hand, my talk will provide a good starting point for your own data ethics journey.

General: Ethics & Privacy
Kuppelsaal
11:05
11:05
30min
Actionable Machine Learning in the Browser with PyScript
Valerio Maggio

PyScript brings the full PyData stack in the browser, opening up to unprecedented use cases for interactive data-intensive applications. In this scenario, the web browser becomes a ubiquitous computing platform, operating within a (nearly) zero-installation & server-less environment.

In this talk, we will explore how to create full-fledged interactive front-end machine learning applications using PyScript. We will dive into the the main features of the PyScript platform (e.g. built-in Javascript integration and local modules ), discussing new data & design patterns (e.g. loading heterogeneous data in the browser), required to adapt and to overcome the limitations imposed by the new operating environment (i.e. the browser).

PyData: Machine Learning & Stats
B07-B08
11:05
30min
BLE and Python: How to build a simple BLE project on Linux with Python
Bruno Vollmer

Bluetooth Low Energy (BLE) is a part of the Bluetooth standard aimed at bringing wireless technology to low-power devices, and it's getting into everything - lightbulbs, robots, personal health and fitness devices, and plenty more. One of the main advantages of BLE is that everybody can integrate those devices into their tools or projects.

However, BLE is not the most developer-friendly protocol and these devices most of the time don't come with good documentation. In addition, there are not a lot of good open-source tools, examples, and tutorials on how to use Python with BLE. Especially if one wants to build both sides of the communication.

In this talk, I will introduce the concepts and properties used in BLE interactions and look at how we can use the Linux Bluetooth Stack (Bluez) to communicate with other devices. We will look at a simple example and learn along the way about common pitfalls and debugging options while working with BLE and Python.

This talk is for everybody that has a basic understanding of Python and wants to have a deeper understanding of how BLE works and how one could use it in a private project.

PyCon: Programming & Software Engineering
A1
11:05
30min
How Chatbots work – We need to talk!
Yuqiong Weng, Katrin Reininger

Chatbots are fun to use, ranging from simple chit-chat (“How are you today?”) to more sophisticated use cases like shopping assistants, or the diagnosis of technical or medical problems. Despite their mostly simple user interaction, chatbots must combine various complex NLP concepts to deliver convincing, intelligent, or even witty results.

With the advancing development of machine learning models and the availability of open source frameworks and libraries, chatbots are becoming more powerful every day and at the same time easier to implement. Yet, depending on the concrete use case, the implementation must be approached in specific ways. In the design process of chatbots it is crucial to define the language processing tasks thoroughly and to choose from a variety of techniques wisely.

In this talk, we will look together at common concepts and techniques in modern chatbot implementation as well as practical experiences from an E-mobility bot that was developed using the Rasa framework.

PyData: Natural Language Processing
B05-B06
11:05
30min
Rusty Python: A Case Study
Robin Raymond

Python is a very expressive and powerful language, but it is not always the fastest option for performance-critical parts of an application. Rust, on the other hand, is known for its lightning-fast runtime and low-level control, making it an attractive option for speeding up performance-sensitive portions of Python programs.

In this talk, we will present a case study of using Rust to speed up a critical component of a Python application. We will cover the following topics:

  • An overview of Rust and its benefits for Python developers
  • Profiling and identifying performance bottlenecks in Python application
  • Implementing a solution in Rust and integrating it with the Python application using PyO3
  • Measuring the performance improvements and comparing them to other optimization techniques

Attendees will learn about the potential for using Rust to boost the performance of their Python programs and how to go about doing so in their own projects.

PyCon: Programming & Software Engineering
Kuppelsaal
11:05
30min
“Who is an NLP expert?” - Lessons Learned from building an in-house QA-system
Nico Kreiling, Alina Bickel

Innovations such as sentence-transformers, neural search and vector databases fueled a very fast development of question-answering systems recently. At scieneers, we wanted to test those components to satisfy our own information needs using a slack-bot that will answer our questions by reading through our internal documents and slack-conversations. We therefore leveraged the HayStack QA-Framework in combination with a Weaviate vector database and many fine-tuned NLP-models.
This talk will give you insights in both, the technical challenges we faced and the organizational learnings we took.

PyData: Natural Language Processing
B09
11:40
11:40
30min
5 Things about fastAPI I wish we had known beforehand
Alexander CS Hendorf

An exchange of views on fastAPI in practice.

FastAPI is great, it helps many developers create REST APIs based on the OpenAPI standard and run them asynchronously. It has a thriving community and educational documentation.

FastAPI does a great job of getting people started with APIs quickly.

This talk will point out some obstacles and dark spots that I wish we had known about before. In this talk we want to highlight solutions.

PyCon: Libraries
Kuppelsaal
11:40
30min
How Python enables future computer chips
Tim Hoffmann

At the semiconductor division of Carl Zeiss it's our mission to continuously make computer chips faster and more energy efficient. To do so, we go to the very limits of what is possible, both physically and technologically. This is only possible through massive research and development efforts.

In this talk, we tell the story how Python became a central tool for our R&D activities. This includes technical aspects as well as organization and culture. How do you make sure that hundreds of people work in consistent environments? – How do you get all people on board to work together with Python? – You have lots of domain experts without much software background. How do you prevent them from creating a mess when projects get larger?

Sponsor
B07-B08
11:40
30min
Maps with Django
Paolo Melchiorre

Keeping in mind the Pythonic principle that “simple is better than complex” we'll see how to create a web map with the Python based web framework Django using its GeoDjango module, storing geographic data in your local database on which to run geospatial queries.

PyCon: Django
A1
11:40
30min
Observability for Distributed Computing with Dask
Hendrik Makait

Debugging is hard. Distributed debugging is hell.

Dask is a popular library for parallel and distributed computing in Python. Dask is commonly used in data science, actual science, data engineering, and machine learning to distribute workloads onto clusters of many hundreds of workers with ease.

However, when things go wrong life can become difficult due to all of the moving parts. These parts include your code, other PyData libraries like NumPy/pandas, the machines you’re running on, the network between them, storage, the cloud, and of course issues with Dask itself. It can be difficult to understand what is going on, especially when things seem slower than they should be or fail unexpectedly. Observability is the key to sanity and success.

In this talk, we describe the tools Dask offers to help you observe your distributed cluster, analyze performance, and monitor your cluster to react to unexpected changes quickly. We will dive into distributed logging, automated metrics, event-based monitoring, and root-causing problems with diagnostic tooling. Throughout the talk, we will leverage real-world use cases to show how these tools help to identify and solve problems for large-scale users in the wild.

This talk should be particularly insightful for Dask users, but the approaches to observing distributed systems should be relevant to anyone operating at scale in production.

PyData: PyData & Scientific Libraries Stack
B09
11:40
30min
Using transformers – a drama in 512 tokens
Marianne Stecklina

“Got an NLP problem nowadays? Use transformers! Just download a pretrained model from the hub!” - every blog article ever

As if it’s that easy, because nearly all pretrained models have a very annoying limitation: they can only process short input sequences. Not every NLP practitioner happens to work on tweets, but instead many of us have to deal with longer input sequences. What started as a minor design choice for BERT, got cemented by the research community over the years and now turns out to be my biggest headache: the 512 tokens limit.

In this talk, we’ll ask a lot of dumb questions and get an equal number of unsatisfying answers:

  1. How much text actually fits into 512 tokens? Spoiler: not enough to solve my use case, and I bet a lot of your use cases, too.

  2. I can feed a sequence of any length into an RNN, why do transformers even have a limit? We’ll look into the architecture in more detail to understand that.

  3. Somebody smart must have thought about this sequence length issue before, or not? Prepare yourself for a rant about benchmarks in NLP research.

  4. So what can we do to handle longer input sequences? Enjoy my collection of mediocre workarounds.

PyData: Natural Language Processing
B05-B06
12:10
12:10
60min
Lunch
Kuppelsaal
12:10
60min
Lunch
B09
12:10
60min
Lunch
B07-B08
12:10
60min
Lunch
B05-B06
12:10
60min
Lunch
A1
12:10
60min
Lunch
A03-A04
12:10
60min
Lunch
A05-A06
13:10
13:10
5min
Announcements
Kuppelsaal
13:15
13:15
45min
Keynote - Towards Learned Database Systems
Carsten Binnig

Database Management Systems (DBMSs) are the backbone for managing large volumes of data efficiently and thus play a central role in business and science today. For providing high performance, many of the most complex DBMS components such as query optimizers or schedulers involve solving non-trivial problems. To tackle such problems, very recent work has outlined a new direction of so-called learned DBMSs where core parts of DBMSs are being replaced by machine learning (ML) models which has shown to provide significant performance benefits. However, a major drawback of the current approaches to enabling learned DBMS components is that they not only cause very high overhead for training an ML model to replace a DBMS component but that the overhead occurs repeatedly which renders these approaches far from practical. Hence, in this talk, I present my vision of Learned DBMS Components 2.0 to tackle these issues. First, I will introduce data-driven learning where the idea is to learn the data distribution over a complex relational schema. In contrast to workload-driven learning, no large workload has to be executed on the database to gather training data. While data-driven learning has many applications such as cardinality estimation or approximate query processing, many DBMS tasks such as physical cost estimation cannot be supported. I thus propose a second technique called zero-shot learning which is a general paradigm for learned DBMS components. Here, the idea is to train model

Plenary
Kuppelsaal
14:05
14:05
90min
Data Kata: Ensemble programming with Pydantic #1
Lev Konstantinovskiy, Nitsan Avni, Gregor Riegler

Write code as an ensemble to solve a data validation problem with Pydantic. Working together is not just about code - learn how to listen to colleagues, make typos in front of everyone, become a supportive team member, defend your ideas and maybe even accept criticism.

PyCon: Programming & Software Engineering
A05-A06
14:05
90min
Let's contribute to pandas (3 hours) #1
Noa Tamir, Patrick Hoefler

PyData Berlin are excited to bring you this open source workshop dedicated to contributing to pandas. This tutorial is 3 hours. We will have a break and continue with the same group of people.

pandas is a data wrangling platform for Python widely adopted in the scientific computing community. In this session, you will be guided on how you can make your own contributions to the project, no prior experience contributing required! Not only will this teach you new skills and boost your CV, you'll also likely get a nice adrenaline rush when your contribution is accepted!

If you don’t finish your contribution during the event, we hope you will continue to work on it after the tutorial. pandas offers regular new contributor meetings and has a slack space to provide ongoing support for new contributors. For more details, see our contributor community page: http://pandas.pydata.org/docs/dev/development/community.html .

PyData: PyData & Scientific Libraries Stack
A03-A04
14:10
14:10
30min
Data-driven design for the Dask scheduler
Guido Imperiale

Historically, changes in the scheduling algorithm of Dask have often been based on theory, single use cases, or even gut feeling. Coiled has now moved to using hard, comprehensive performance metrics for all changes - and it's been a turning point!

PyCon: Programming & Software Engineering
B07-B08
14:10
30min
Getting started with JAX
Simon Pressler

Deepminds JAX ecosystem provides deep learning practitioners with an appealing alternative to TensorFlow and PyTorch. Among its strengths are great functionalities such as native TPU support, as well as easy vectorization and parallelization. Nevertheless, making your first steps in JAX can feel complicated given some of its idiosyncrasies. This talk helps new users getting started in this promising ecosystem by sharing practical tips and best practises.

PyData: Deep Learning
Kuppelsaal
14:10
30min
Methods for Text Style Transfer: Text Detoxification Case
Daryna Dementieva

Global access to the Internet has enabled the spread of information throughout the world and has offered many new possibilities. On the other hand, alongside the advantages, the exponential and uncontrolled growth of user-generated content on the Internet has also facilitated the spread of toxicity and hate speech. Much work has been done in the direction of offensive speech detection. However, there is another more proactive way to fight toxic speech -- how a suggestion for a user as a detoxified version of the message. In this presentation, we will provide an overview how texts detoxification task can be solved. The proposed approaches can be reused for any text style transfer task for both monolingual and multilingual use-cases.

PyData: Natural Language Processing
A1
14:10
30min
Pragmatic ways of using Rust in your data project
Christopher Prohm

Writing efficient data pipelines in Python can be tricky. The standard recommendation is to use vectorized functions implemented in Numpy, Pandas, or the like. However, what to do, when the processing task does not fit these libraries? Using plain Python for processing can result in lacking performance, in particular when handling large data sets.

Rust is a modern, performance-oriented programming language that is already widely used by the Python community. Augmenting data processing steps with Rust can result in substantial speed ups. In this talk will present strategies of using Rust in a larger Python data processing pipeline with a particular focus on pragmatism and minimizing integration efforts.

PyData: Data Handling
B05-B06
14:10
30min
You are what you read: Building a personal internet front-page with spaCy and Prodigy
Victoria Slocum

Sometimes the internet can be a bit overwhelming, so I thought I would make a tool to create a personalized summary of it! In this talk, I'll demonstrate a personal front-page project that allows me to filter info on the internet on a certain topic, built using spaCy, an open-source library for NLP, and Prodigy, a scriptable annotation tool. With this project, I learned about the power of working with tools that provide extensive customizability without sacrificing ease of use. Throughout the talk, I'll also discuss how design concepts of developer tools can improve the development experience when building complex and adaptable software.

PyData: Natural Language Processing
B09
14:45
14:45
45min
Accelerating Public Consultations with Large Language Models: A Case Study from the UK Planning Inspectorate
Michele Dallachiesa, Andreas Leed

Local Planning Authorities (LPAs) in the UK rely on written representations from the community to inform their Local Plans which outline development needs for their area. With an average of 2000 representations per consultation and 4 rounds of consultation per Local Plan, the volume of information can be overwhelming for both LPAs and the Planning Inspectorate tasked with examining the legality and soundness of plans. In this study, we investigate the potential for Large Language Models (LLMs) to streamline representation analysis.

We find that LLMs have the potential to significantly reduce the time and effort required to analyse representations, with simulations on historical Local Plans projecting a reduction in processing time by over 30%, and experiments showing classification accuracy of up to 90%.

In this presentation, we discuss our experimental process which used a distributed experimentation environment with Jupyter Lab and cloud resources to evaluate the performance of the BERT, RoBERTa, DistilBERT, and XLNet models. We also discuss the design and prototyping of web applications to support the aided processing of representations using Voilà, FastAPI, and React. Finally, we highlight successes and challenges encountered and suggest areas for future improvement.

PyData: Natural Language Processing
A1
14:45
45min
Delivering AI at Scale
Anna Achenbach, Severin Schmitt, Thorsten Kranz

Everybody knows our yellow vans, trucks and planes around the world. But do you know how data
drives our business and how we leverage algorithms and technology in our core operations? We will
share some “behind the scenes” insights on Deutsche Post DHL Group’s journey towards a Data-Driven
Company.
• Large-Scale Use Cases: Challenging and high impact Use Cases in all major areas of logistics,
including Computer Vision and NLP
• Fancy Algorithms: Deep-Neural Networks, TSP Solvers and the standard toolkit of a Data
Scientist
• Modern Tooling: Cloud Platforms, Kubernetes , Kubeflow, Auto ML
• No rusty working mode: small, self-organized, agile project teams, combining state of the art
Machine Learning with MLOps best practices
• A young, motivated and international team – German skills are only “nice to have”
But we have more to offer than slides filled with buzzwords. We will demonstrate our passion for our
work, deep dive into our largest use cases that impact your everyday life and share our approach for a
timeseries forecasting library - combining data science, software engineering and technology for
efficient and easy to maintain machine learning projects..

Sponsor
B09
14:45
45min
Visualizing your computer vision data is not a luxury, it's a necessity: without it, your models are blind and so do you.
Arnault Chazareix

Are you ready to take your Computer Vision projects to the next level? Then don't miss this talk!

Data visualization is a crucial ingredient for the success of any computer vision project.
It allows you to assess the quality of your data, grasp the intricacies of your project, and communicate effectively with stakeholders.

In this talk, we'll showcase the power of data visualization with compelling examples. You'll learn about the benefits of data visualization and discover practical methods and tools to elevate your projects.

Don't let this opportunity pass you by: join us and learn how to make data visualization a core feature of your Computer Vision projects.

PyData: Computer Vision
B07-B08
14:45
45min
When A/B testing isn’t an option: an introduction to quasi-experimental methods
Inga Janczuk

Identification of causal relationships through running experiments is not always possible. In this talk, an alternative approach towards it, quasi-experimental frameworks, is discussed. Additionally, I will present how to adjust well-known machine-learning algorithms so they can be used to quantify causal relationships.

PyData: Machine Learning & Stats
Kuppelsaal
14:45
45min
Writing Plugin Friendly Python Applications
Travis Hathaway

In modern software engineering, plugin systems are a ubiquitous way to extend and modify the behavior of applications and libraries. When software is written in a way that is plugin friendly, it encourages the use of modular organization where the contracts between the core software and the plugin have been well thought out. In this talk, we cover exactly how to define this contract and how you can start designing your software to be more plugin friendly.

Throughout the talk we will be creating our own plugin friendly application using the pluggy library to show these design principles in action. At the end of the talk, I also cover a real-life case study of how the package manager conda is currently making its 10 year old code more plugin friendly to illustrate how to retrofit an existing project.

PyCon: Programming & Software Engineering
B05-B06
15:30
15:30
30min
Coffee Break
Kuppelsaal
15:30
30min
Coffee Break
B09
15:30
30min
Coffee Break
B07-B08
15:30
30min
Coffee Break
B05-B06
15:30
30min
Coffee Break
A1
15:45
15:45
90min
Data Kata: Ensemble programming with Pydantic #2
Lev Konstantinovskiy, Gregor Riegler, Nitsan Avni

Write code as an ensemble to solve a data validation problem using Pydantic. Working together is not just about code - learn how to listen to colleagues, make typos in front of everyone, become a supportive team member, defend your ideas and maybe even accept criticism.

PyCon: Programming & Software Engineering
A05-A06
15:45
90min
Let's contribute to pandas (3 hours) #2
Noa Tamir, Patrick Hoefler

PyData Berlin are excited to bring you this open source workshop dedicated to contributing to pandas. This tutorial is 3 hours. We will have a break and continue with the same group of people.

pandas is a data wrangling platform for Python widely adopted in the scientific computing community. In this session, you will be guided on how you can make your own contributions to the project, no prior experience contributing required! Not only will this teach you new skills and boost your CV, you'll also likely get a nice adrenaline rush when your contribution is accepted!

If you don’t finish your contribution during the event, we hope you will continue to work on it after the tutorial. pandas offers regular new contributor meetings and has a slack space to provide ongoing support for new contributors. For more details, see our contributor community page: http://pandas.pydata.org/docs/dev/development/community.html .

PyData: PyData & Scientific Libraries Stack
A03-A04
16:00
16:00
30min
Enabling Machine Learning: How to Optimize Infrastructure, Tools and Teams for ML Workflows
Yann Lemonnier

In this talk, we will explore the role of a machine learning enabler engineer in facilitating the development and deployment of machine learning models. We will discuss best practices for optimizing infrastructure and tools to streamline the machine learning workflow, reduce time to deployment, and enable data scientists to extract insights and value from data more efficiently. We will also examine case studies and examples of successful machine learning enabler engineering projects and share practical tips and insights for anyone interested in this field.

Sponsor
B09
16:00
30min
Introducing FastKafka
Tvrtko Sternak

FastKafka is a Python library that makes it easy to connect to Apache Kafka queues and send and receive messages. In this talk, we will introduce the library and its features for working with Kafka queues in Python. We will discuss the motivations for creating the library, how it compares to other Kafka client libraries, and how to use its decorators to define functions for consuming and producing messages. We will also demonstrate how to use these functions to build a simple application that sends and receives messages from the queue. This talk will be of interest to Python developers looking for an easy-to-use solution for working with Kafka.

The documentation of the library can be found here: https://fastkafka.airt.ai/

PyCon: Libraries
B07-B08
16:00
30min
MLOps in practice: our journey from batch to real-time inference
Theodore Meynard

I will present the challenges we encountered while migrating an ML model from batch to real-time predictions and how we handled them. In particular, I will focus on the design decisions and open-source tools we built to test the code, data and models as part of the CI/CD pipeline and enable us to ship fast with confidence.

PyCon: DevOps & MLOps
A1
16:00
60min
PyLadies Panel Session. Tech Illusions and the Unbalanced Society: Finding Solutions for a Better Future

During this panel, we’ll discuss the significant role PyLadies chapters around the world have played in advocating for gender representation and leadership and combating biases and the gender pay gap.

General: Python & PyData Friends
Kuppelsaal
16:00
30min
The bumps in the road: A retrospective on my data visualisation mistakes
Artem Kislovskiy

We will delve into the importance of effective data visualisation in today's world. We will explore how it can help convey insights from data using Matplotlib and best practices for creating informative visualisations. We will also discuss the limitations of static visualisations and examine the role of continuous integration in streamlining the process and avoiding common pitfalls. By the end of this talk, you will have gained valuable insights and techniques for creating informative and accurate data visualisations, no matter what tools you're using.

PyData: Visualisation
B05-B06
16:35
16:35
30min
Ask-A-Question: an FAQ-answering service for when there's little to no data
Suzin You

Doing data science in international development often means finding the right-sized solution in resource-constrained settings.

This talk walks you through how my team helped answer thousands of questions from pregnant folks and new parents on a South African maternal and child health helpline, which model we ended up choosing and why (hint: resource-constraints!), and how we've packaged everything into a service that anyone can start for themselves,

By the end of the talk, I hope you'll know how to start your own FAQ-answering service and learn about one example of doing data science in international development.

PyData: Natural Language Processing
A1
16:35
30min
Neo4j graph databases for climate policy
Marcus Tedesco

In this talk we walkthrough our experience using Neo4j and Python to model climate policy as a graph database. We discuss how we did it, some of the challenges we faced, and what we learnt along the way!

PyData: Data Handling
B07-B08
16:35
30min
Use Spark from anywhere: A Spark client in Python powered by Spark Connect
Martin Grund

Over the past decade, developers, researchers, and the community have successfully built tens of thousands of data applications using Spark. Since then, use cases and requirements of data applications have evolved: Today, every application, from web services that run in application servers, interactive environments such as notebooks and IDEs, to phones and edge devices such as smart home devices, want to leverage the power of data.

However, Spark's driver architecture is monolithic, running client applications on top of a scheduler, optimizer and analyzer. This architecture makes it hard to address these new requirements: there is no built-in capability to remotely connect to a Spark cluster from languages other than SQL.

Spark Connect introduces a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages.

This talk highlights how simple it is to connect to Spark using Spark Connect from any data applications or IDEs. We will do a deep dive into the architecture of Spark Connect and give an outlook of how the community can participate in the extension of Spark Connect for new programming languages and frameworks - to bring the power of Spark everywhere.

Sponsor
B09
17:00
17:00
30min
Coffee Break
Kuppelsaal
17:05
17:05
25min
Coffee Break
B09
17:05
25min
Coffee Break
B07-B08
17:05
25min
Coffee Break
B05-B06
17:05
25min
Coffee Break
A1
17:30
17:30
60min
Lightning Talks 60Min
Kuppelsaal
17:30
60min
PyLadies Workshop

A workshop for PyLadies members with the Berlin Tech Workers Council discussing the legal frameworks on contracts and termination agreements, as well as how employees can defend themselves in situations where they are made redundant due to mass layoffs.

General: Others
B05-B06
18:25
18:25
120min
Social Gathering @BCC
Kuppelsaal
09:00
09:00
10min
Announcements
Kuppelsaal
09:10
09:10
45min
Keynote - Lorem ipsum dolor sit amet
Miroslav Šedivý

A life without joy is like software without meaningful test data - it's uncertain and unreliable. The search for the perfect test data is a challenge. Real data should not be too real. Random data should not be too random. This is a randomly real and a really random journey to discover the balance between these two, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Plenary
Kuppelsaal
10:00
10:00
45min
Accelerating Python Code
Jens Nie

Python is a beautiful language for fast prototyping and and sketching ideas quickly. People often struggle to get their code into production though for various reasons. Besides of all security and safety concerns that usually are not addressed from the very beginning when playing around with an algorithmic idea, performance concerns are quite frequently a reason for not taking the Python code to the next level.

We will look at the "missing performance" worries using a simple numerical problem and how to speed the corresponding Python code up to top notch performance.

PyCon: Python Language
Kuppelsaal
10:00
45min
Advanced Visual Search Engine with Self-Supervised Learning (SSL) Representations and Milvus
Antoine Toubhans, Noé Achache

Image retrieval is the process of searching for images in a large database that are similar to one or more query images. A classical approach is to transform the database images and the query images into embeddings via a feature extractor (e.g., a CNN or a ViT), so that they can be compared via a distance metric. Self-supervised learning (SSL) can be used to train a feature extractor without the need for expensive and time-consuming labeled training data. We will use DINO's SSL method to build a feature extractor and Milvus, an open-source vector database built for evolutionary similarity search, to index image representation vectors for efficient retrieval. We will compare the SSL approach with supervised and pre-trained feature extractors.

PyData: Computer Vision
B05-B06
10:00
90min
Building Hexagonal Python Services
Shahriyar Rzayev

The importance of enterprise architecture patterns is all well-known and applicable to varied types of tasks. Thinking about the architecture from the beginning of the journey is crucial to have a maintainable, therefore testable, and flexible code base. In We are going to explore the Ports and Adapters(Hexagonal) pattern by showing a simple web app using Repository, Unit of Work, and Services(Use Cases) patterns tied together with Dependency Injection. All those patterns are quite famous in other languages but they are relatively new for the Python ecosystem, which is a crucial missing part.
As a web framework, we are going to use FastAPI which can be replaced with any framework in a matter of time because of the abstractions we have added.

PyCon: Programming & Software Engineering
A05-A06
10:00
90min
Create interactive Jupyter websites with JupyterLite
Jeremy Tuloup

Jupyter notebooks are a popular tool for data science and scientific computing, allowing users to mix code, text, and multimedia in a single document. However, sharing Jupyter notebooks can be challenging, as they require installing a specific software environment to be viewed and executed.

JupyterLite is a Jupyter distribution that runs entirely in the web browser without any server components. A significant benefit of this approach is the ease of deployment. With JupyterLite, the only requirement to provide a live computing environment is a collection of static assets. In this talk, we will show how you can create such static website and deploy it to your users.

PyData: Jupyter
A03-A04
10:00
45min
Monorepos with Python
AbdealiLoKo

Working with python is fun.
Managing python packaging, linters, tests, CI, etc. is not as fun.

Every maintainer needs to worry about consistent styling, quality, speed of tests, etc as the project grows.

Monorepos have been successful in other communities - how does it work in Python ?

PyCon: Programming & Software Engineering
B07-B08
10:00
45min
The Spark of Big Data: An Introduction to Apache Spark
Pasha Finkelshteyn

Get ready to level up your big data processing skills! Join us for an introductory talk on Apache
Spark, the distributed computing system used by tech giants like Netflix and Amazon. We'll
cover PySpark DataFrames and how to use them. Whether you're a Python developer new to
big data or looking to explore new technologies, this talk is for you. You'll gain foundational
knowledge about Apache Spark and its capabilities, and learn how to leverage DataFrames and
SQL APIs to efficiently process large amounts of data. Don't miss out on this opportunity to up
your big data game!

Sponsor
B09
10:00
45min
Why GPU Clusters Don't Need to Go Brrr? Leverage Compound Sparsity to Achieve the Fastest Inference Performance on CPUs
Damian Bogunowicz

Forget specialized hardware. Get GPU-class performance on your commodity CPUs with compound sparsity and sparsity-aware inference execution.
This talk will demonstrate the power of compound sparsity for model compression and inference speedup for NLP and CV domains, with a special focus on the recently popular Large Language Models. The combination of structured + unstructured pruning (to 90%+ sparsity), quantization, and knowledge distillation can be used to create models that run an order of magnitude faster than their dense counterparts, without a noticeable drop in accuracy. The session participants will learn the theory behind compound sparsity, state-of-the-art techniques, and how to apply it in practice using the Neural Magic platform.

PyData: Deep Learning
A1
10:50
10:50
30min
Haystack for climate Q/A
Vibha Vikram Rao

How can NLP and Haystack help answer sustainability questions and fight climate change? In this talk we walkthrough our experience using Haystack to build Question Answering Models for the climate change and sustainability domain. We discuss how we did it, some of the challenges we faced, and what we learnt along the way!

PyData: Natural Language Processing
A1
10:50
30min
Shrinking gigabyte sized scikit-learn models for deployment
Pavel Zwerschke, Yasin Tatar

We present an open source library to shrink pickled scikit-learn and lightgbm models. We will provide insights of how pickling ML models work and how to improve the disk representation. With this approach, we can reduce the deployment size of machine learning applications up to 6x.

PyData: PyData & Scientific Libraries Stack
B09
10:50
30min
Teaching Neural Networks a Sense of Geometry
Jens Agerberg

By taking neural networks back to the school bench and teaching them some elements of geometry and topology we can build algorithms that can reason about the shape of data. Surprisingly these methods can be useful not only for computer vision – to model input data such as images or point clouds through global, robust properties – but in a wide range of applications, such as evaluating and improving the learning of embeddings, or the distribution of samples originating from generative models. This is the promise of the emerging field of Topological Data Analysis (TDA) which we will introduce and review recent works at its intersection with machine learning. TDA can be seen as being part of the increasingly popular movement of Geometric Deep Learning which encourages us to go beyond seeing data only as vectors in Euclidean spaces and instead consider machine learning algorithms that encode other geometric priors. In the past couple of years TDA has started to take a step out of the academic bubble, to a large extent thanks to powerful Python libraries written as extensions to scikit-learn or PyTorch.

PyData: Deep Learning
B05-B06
10:50
30min
Thou Shall Judge But With Fairness: Methods to Ensure an Unbiased Model
Nandana Sreeraj

Is your model prejudicial? Is your model deviating from the predictions it ought to have made? Has your model misunderstood the concept? In the world of artificial intelligence and machine learning, the word "fairness" is particularly common. It is described as having the quality of being impartial or fair. Fairness in ML is essential for contemporary businesses. It helps build consumer confidence and demonstrates to customers that their issues are important. Additionally, it aids in ensuring adherence to guidelines established by authorities. So guaranteeing that the idea of responsible AI is upheld. In this talk, let's explore how certain sensitive features are influencing the model and introducing bias into it. We'll also look at how we can make it better.

General: Ethics & Privacy
B07-B08
10:50
30min
Unlocking Information - Creating Synthetic Data for Open Access.
Antonia Scherz

Many good project ideas fail before they even start due to the sensitive personal data required. The good news: a synthetic version of this data does not need protection. Synthetic data copies the actual data's structure and statistical properties without recreating personally identifiable information. The bad news: It is difficult to create synthetic data for open-access use, without recreating the exact copy of actual data.
This talk will give hands-on insights into synthetic data creation and challenges along its lifecycle. We will learn how to create and evaluate synthetic data for any use case using the open-source package Synthetic Data Vault. We will find answers to why it takes so long to synthesize the huge amount of data dormant in public administration. The talk addresses owners who want to create access to their private data as well as analysts looking to use synthetic data. After this session, listeners will know which steps to take to generate synthetic data for multi-purpose use and its limitations for real-world analyses.

PyData: PyData & Scientific Libraries Stack
Kuppelsaal
11:20
11:20
30min
Coffee Break
Kuppelsaal
11:20
30min
Coffee break
B09
11:20
30min
Coffee Break
B07-B08
11:20
30min
Coffee Break
B05-B06
11:20
30min
Coffee Break
A1
11:40
11:40
90min
Most of you don't need Spark. Large-scale data management on a budget with Python
Guillem Borrell

The Python data ecosystem has matured during the last decade and there are less and less reasons to rely only large batch process executed in a Spark cluster, but with every large ecosystem, putting together the key pieces of technology takes some effort. There are now better storage technologies, streaming execution engines, query planners, and low level compute libraries. And modern hardware is way more powerful than what you'd probably expect. In this workshop we will explore some global-warming-reducing techniques to build more efficient data transformation pipelines in Python, and a little bit of Rust.

PyData: Data Handling
A03-A04
11:40
90min
Workshop on Privilege and Ethics in Data
Tereza Iofciu, Paula Gonzalez Avalos

Data-driven products are becoming more and more ubiquitous. Humans build data-driven products. Humans are intrinsically biased. This bias goes into the data-driven products, confirming and amplifying the original bias. In this tutorial, you will learn how to identify your own -often unperceived- biases and reflect on and discuss the consequences of unchecked biases in Data Products.

General: Ethics & Privacy
A05-A06
11:50
11:50
30min
Grokking Anchors: Uncovering What a Machine-Learning Model Relies On
KIlian Kluge

Assessing the robustness of models is an essential step in developing machine-learning systems. To determine if a model is sound, it often helps to know which and how many input features its output hinges on. This talk introduces the fundamentals of “anchor” explanations that aim to provide that information.

PyData: Machine Learning & Stats
A1
11:50
30min
Modern typed python: dive into a mature ecosystem from web dev to machine learning
samsja

Typing is at the center of „modern Python“, and tools (mypy, beartype) and libraries (FastAPI, SQLModel, Pydantic, DocArray) based on it are slowly eating the Python world.

This talks explores the benefits of Python type hints, and shows how they are infiltrating the next big domain: Machine Learning

PyCon: Python Language
Kuppelsaal
11:50
30min
Prompt Engineering 101: Beginner intro to LangChain, the shovel of our ChatGPT gold rush."
Lev Konstantinovskiy

A modern AI start-up is a front-end developer plus a prompt engineer" is a popular joke on Twitter.
This talk is about LangChain, a Python open-source tool for prompt engineering. You can use it with completely open-source language models or ChatGPT. I will show you how to create a prompt and get an answer from LLM. As an example application, I will show a demo of an intelligent agent using web search and generating Python code to answer questions about this conference.

PyData: Natural Language Processing
B07-B08
11:50
30min
The future of the Jupyter Notebook interface
Jeremy Tuloup

Jupyter Notebooks have been a widely popular tool for data science in recent years due to their ability to combine code, text, and visualizations in a single document.

Despite its popularity, the core functionality and user experience of the Classic Jupyter Notebook interface has remained largely unchanged over the past years.

Lately the Jupyter Notebook project decided to base its next major version 7 on JupyterLab components and extensions, which means many JupyterLab features are also available to Jupyter Notebook users.

In this presentation, we will demo the new features coming in Jupyter Notebook version 7 and how they are relevant to existing users of the Classic Notebook.

PyData: Jupyter
B05-B06
11:50
30min
What are you yield from?
Maxim Danilov

Many developers avoid using generators. For example, many well-known python libraries use lists instead of generators. The generators themselves are slower than normal list loops, but their use in code greatly increases the speed of the application. Let’s discover why.

PyCon: Python Language
B09
12:25
12:25
30min
Code Cleanup: A Data Scientist's Guide to Sparkling Code
Corrie Bartelheimer

Does your production code look like it’s been copied from Untitled12.ipynb? Are your engineers complaining about the code but you can’t find the time to work on improving the code base? This talk will go through some of the basics of clean coding and how to best implement them in a data science team.

PyCon: Programming & Software Engineering
Kuppelsaal
12:25
30min
Dynamic pricing at Flix
Amit Verma

In the talk we give a brief overview of how we use Dynamic Pricing to tune the prices for rides based on demand, time of purchase, unexpected events strike etc., and other criteria to fulfil our business requirements.

Sponsor
B05-B06
12:25
30min
How to connect your application to the world (and avoid sleepless nights)
Luis Fernando Alvarez

Let’s say you are the ruler of a remote island. For it to succeed and thrive you can’t expect it to be isolated from the world. You need to establish trade routes, offer your products to other islands, and import items from them. Doing this will certainly make your economy grow! We’re not going to talk about land masses or commerce, however, you should think of your application as an island that needs to connect to other applications to succeed. Unfortunately, the sea is treacherous and is not always very consistent, similar to the networks you use to connect your application to the world.

We will explore some techniques and libraries in the Python ecosystem used to make your life easier while dealing with external services. From asynchronicity, caching, testing, and building abstractions on top of the APIs you consume, you will definitely learn some strategies to build your connected application gracefully, and avoid those pesky 2 AM errors that keep you awake.

PyCon: Programming & Software Engineering
B07-B08
12:25
30min
Maximizing Efficiency and Scalability in Open-Source MLOps: A Step-by-Step Approach
Paul Elvers

This talk presents a novel approach to MLOps that combines the benefits of open-source technologies with the power and cost-effectiveness of cloud computing platforms. By using tools such as Terraform, MLflow, and Feast, we demonstrate how to build a scalable and maintainable ML system on the cloud that is accessible to ML Engineers and Data Scientists. Our approach leverages cloud managed services for the entire ML lifecycle, reducing the complexity and overhead of maintenance and eliminating the vendor lock-in and additional costs associated with managed MLOps SaaS services. This innovative approach to MLOps allows organizations to take full advantage of the potential of machine learning while minimizing cost and complexity.

PyCon: DevOps & MLOps
A1
12:25
30min
Streamlit meets WebAssembly - stlite
Yuichiro Tachibana

Streamlit, a pure-Python data app framework, has been ported to Wasm as "stlite".
See its power and convenience with many live examples and explore its internals from a technical perspective.
You will learn to quickly create interactive in-browser apps using only Python.

PyCon: Web
B09
12:55
12:55
65min
Lunch
Kuppelsaal
12:55
65min
Lunch
B09
12:55
65min
Lunch
B09
12:55
65min
Lunch
B09
12:55
65min
Lunch
B07-B08
12:55
65min
Lunch
B05-B06
12:55
65min
Lunch
A1
13:10
13:10
50min
Lunch
A03-A04
13:10
50min
Lunch
A05-A06
14:00
14:00
30min
Apache Arrow: connecting and accelerating dataframe libraries across the PyData ecosystem
Joris Van den Bossche

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing, and is becoming the de facto standard for tabular data. This talk will give an overview of the recent developments both in Apache Arrow itself as how it is being adopted in the PyData ecosystem (and beyond) and can improve your day-to-day data analytics workflows.

PyData: PyData & Scientific Libraries Stack
B05-B06
14:00
30min
Behind the Scenes of tox: The Journey of Rewriting a Python Tool with more than 10 Million Monthly Downloads
Jürgen Gmach

tox is a widely-used tool for automating testing in Python. In this talk, we will go behind the scenes of the creation of tox 4, the latest version of the tool. We will discuss the motivations for the rewrite, the challenges and lessons learned during the development process. We will have a look at the new features and improvements introduced in tox 4. But most importantly, you will get to know the maintainers.

PyCon: Testing
Kuppelsaal
14:00
30min
Bringing NLP to Production (an end to end story about some multi-language NLP services)
Larissa Haas, Jonathan Brandt

Models in Natural Language Processing are fun to train but can be difficult to deploy. The size of their models, libraries and necessary files can be challenging, especially in a microservice environment. When services should be built as lightweight and slim as possible, large (language) models can lead to a lot of problems. With a recent real-world use case as an example, which runs productively for over a year and in 10 different languages, I will walk you through my experiences with deploying NLP models. What kind of pitfalls, shortcuts, and tricks are possible while bringing an NLP model to production?

In this talk, you will learn about different ways and possibilities to deploy NLP services. I will speak briefly about the way leading from data to model and a running service (without going into much detail) before I will focus on the MLOps part in the end. I will take you with me on my past journey of struggles and successes so that you don’t need to take these detours by yourselves.

PyCon: DevOps & MLOps
B09
14:00
30min
Machine Learning Lifecycle for NLP Classification in E-Commerce
Gunar Maiwald, Tobias Senst

Running machine learning models in a production environment brings its own challenges. In this talk we would like to present our solution of a machine learning lifecycle for the text-based cataloging classification system from idealo.de. We will share lessons learned and talk about our experiences during the lifecycle migration from a hosted cluster to a cloud solution within the last 3 years. In addition, we will outline how we embedded our ML components as part of the overall idealo.de processing architecture.

PyCon: DevOps & MLOps
A1
14:00
30min
You've got trust issues, we've got solutions: Differential Privacy
Sarthika Dhawan, Vikram Waradpande

As we are in an era of big data where large groups of information are assimilated and analyzed, for insights into human behavior, data privacy has become a hot topic. Since there is a lot of private information which once leaked can be misused, all data cannot be released for research. This talk aims to discuss Differential Privacy, a cutting-edge technique of cybersecurity that claims to preserve an individual’s privacy, how it is employed to minimize the risks with private data, its applications in various domains, and how Python eases the task of employing it in our models with PyDP.

PyData: PyData & Scientific Libraries Stack
B07-B08
14:10
14:10
90min
Contributing to an open-source content library for NLP
Leonard Püttmann

Bricks is an open-source content library for natural language processing, which provides the building blocks to quickly and easily enrich, transform or analyze text data for machine learning projects. For many Pythonistas, contributing to an open-source project seems scary and intimidating. In this tutorial, we offer a hands-on experience in which programmers and data scientists learn how to code their own building blocks and share their creations with the community with ease.

PyData: Natural Language Processing
A03-A04
14:10
90min
The Battle of Giants: Causality vs NLP => From Theory to Practice
Aleksander Molak

With an average of 3.2 new papers published on Arxiv every day in 2022, causal inference has exploded in popularity, attracting large amount of talent and interest from top researchers and institutions including industry giants like Amazon or Microsoft. Text data, with its high complexity, posits an exciting challenge for causal inference community. In the workshop, we'll review the latest advances in the field of Causal NLP and implement a causal Transformer model to demonstrate how to translate these developments into a practical solution that can bring real business value. All in Python!

PyData: Natural Language Processing
A05-A06
14:35
14:35
30min
Cloud Infrastructure From Python Code: How Far Could We Go?
Asher Sterkin, Etzik Bega

Discover how Infrastructure From Code (IfC) can revolutionize Cloud DevOps automation by generating cloud deployment templates directly from Python code. Learn how this technology empowers Python developers to easily deploy and operate cost-effective, secure, reliable, and sustainable cloud software. Join us to explore the strategic potential of IfC.

General: Infrastructure - Hardware & Cloud
A1
14:35
30min
Giving and Receiving Great Feedback through PRs
David Andersson

Do you struggle with PRs? Have you ever had to change code even though you disagreed with the change just to land the PR? Have you ever given feedback that would have improved the code only to get into a comment war? We'll discuss how to give and receive feedback to extract maximum value from it and avoid all the communication problems that come with PRs.

PyCon: Programming & Software Engineering
Kuppelsaal
14:35
30min
Introduction to Async programming
Dishant Sethi

Asynchronous programming is a type of parallel programming in which a unit of work is allowed to run separately from the primary application thread. Post execution, it notifies the main thread about the completion or failure of the worker thread. There are numerous benefits to using it, such as improved application performance, enhanced responsiveness, and effective usage of CPU.

Asynchronicity seems to be a big reason why Node.js is so popular for server-side programming. Most of the code we write, especially in heavy IO applications like websites, depends on external resources. This could be anything from a remote database POST API call. As soon as you ask for any of these resources, your code is waiting around for process completion with nothing to do. With asynchronous programming, you allow your code to handle other tasks while waiting for these other resources to respond.

In this session, we are going to talk about asynchronous programming in Python. Its benefits and multiple ways to implement it.

PyCon: Programming & Software Engineering
B07-B08
14:35
30min
The Beauty of Zarr
Sanket Verma

In this talk, I’d be talking about Zarr, an open-source data format for storing chunked, compressed N-dimensional arrays. This talk presents a systematic approach to understanding and implementing Zarr by showing how it works, the need for using it, and a hands-on session at the end. Zarr is based on an open technical specification, making implementations across several languages possible. I’d mainly talk about Zarr’s Python implementation and show how it beautifully interoperates with the existing libraries in the PyData stack.

PyData: PyData & Scientific Libraries Stack
B05-B06
14:35
30min
evosax: JAX-Based Evolution Strategies
Robert Lange

Tired of having to handle asynchronous processes for neuroevolution? Do you want to leverage massive vectorization and high-throughput accelerators for evolution strategies (ES)? evosax allows you to leverage JAX, XLA compilation and auto-vectorization/parallelization to scale ES to your favorite accelerators. In this talk we will get to know the core API and how to solve distributed black-box optimization problems with evolution strategies.

PyData: Machine Learning & Stats
B09
15:10
15:10
30min
Fear the mutants. Love the mutants.
Max Kahan

Developers often use code coverage as a target, which makes it a bad measure of test quality.

Mutation testing changes the game: create mutant versions of your code that break your tests, and you'll quickly start to write better tests!

Come and learn to use it as part of your CI/CD process. I promise, you'll never look at penguins the same way again!

PyCon: Testing
B05-B06
15:10
30min
Great Security Is One Question Away
Wiktoria Dalach

After a decade of writing code, I joined the application security team. During the transition process, I discovered that there are many myths about security, and how difficult it is. Often devs choose to ignore it because they think that writing more secure code would take them ages. It is not true. Security doesn’t have to be scary. From my talk, you will learn the most useful piece from the Application Security theory. It will be practical and not boring at all.

PyCon: Programming & Software Engineering
A1
15:10
30min
How to increase diversity in open source communities
Maren Westermann

Today state of the art technology and scientific research strongly depend on open source libraries. The demographic of the contributors to these libraries is predominantly white and male [1][2][3][4]. This situation creates problems not only for individual contributors outside of this demographic but also for open source projects such as loss of career opportunities and less robust technologies, respectively [1][7]. In recent years there have been a number of various recommendations and initiatives to increase the participation in open source projects of groups who are underrepresented in this domain [1][3][5][6]. While these efforts are valuable and much needed, contributor diversity remains a challenge in open source communities [2][3][7]. This talk highlights the underlying problems and explores how we can overcome them.

General: Community, Diversity, Career, Life and everything else
Kuppelsaal
15:10
30min
Postmodern Architecture: The Python Powered Modern Data Stack
John Sandall

The Modern Data Stack has brought a lot of new buzzwords into the data engineering lexicon: "data mesh", "data observability", "reverse ETL", "data lineage", "analytics engineering". In this light-hearted talk we will demystify the evolving revolution that will define the future of data analytics & engineering teams.

Our journey begins with the PyData Stack: pandas pipelines powering ETL workflows...clean code, tested code, data validation, perfect for in-memory workflows. As demand for self-serve analytics grows, new data sources bring more APIs to model, more code to maintain, DAG workflow orchestration tools, new nuances to capture ("the tax team defines revenue differently"), more dashboards, more not-quite-bugs ("but my number says this...").

This data maturity journey is a well-trodden path with common pitfalls & opportunities. After dashboards comes predictive modelling ("what will happen"), prescriptive modelling ("what should we do?"), perhaps eventually automated decision making. Getting there is much easier with the advent of the Python Powered Modern Data Stack.

In this talk, we will cover the shift from ETL to ELT, the open-source Modern Data Stack tools you should know, with a focus on how dbt's new Python integration is changing how data pipelines are built, run, tested & maintained. By understanding the latest trends & buzzwords, attendees will gain a deeper insight into Python's role at the core of the future of data engineering.

PyData: Data Handling
B09
15:10
30min
Rethinking codes of conduct
Tereza Iofciu

Did you know that the Python Software Foundation Code of Conduct is turning 10 years old in 2023? It was voted in as they felt they were “unbalanced and not seeing the true spectrum of the greater community”.
Why is that a big thing? Come to my talk and find out!

General: Community, Diversity, Career, Life and everything else
B07-B08
15:40
15:40
30min
Coffee Break
Kuppelsaal
15:40
30min
Coffee Break
B09
15:40
30min
Coffee Break
B07-B08
15:40
30min
Coffee Break
B05-B06
15:40
30min
Coffee Break
A1
15:40
30min
Coffee Break
A03-A04
15:40
30min
Coffee Break
A05-A06
16:10
16:10
30min
Closing Session
Kuppelsaal