PyCon DE & PyData Berlin 2023

Cooking up a ML Platform: Growing pains and lessons learned

30min

Cole Bailey

What is a ML platform and do you even need one? When should you consider investing in your own ML platform? What challenges can you expect building and maintaining one? Tune in and discover (some) answers to these questions and more! I will share a first-hand account of our ongoing journey towards becoming a ML platform team within Delivery Hero's Logistics department, including how we got here, how we structure our work, and what challenges and tools we are focussing on next.

From notebook to pipeline in no time with LineaPy

Thomas Fraunholz

The nightmare before data science production: You found a working prototype for your problem using a Jupyter notebook and now it's time to build a production grade solution from that notebook. Unfortunately, your notebook looks anything but production grade. The good news is, there's finally a cure!

The open-source python package LineaPy aims to automate data science workflow generation and expediting the process of going from data science development to production. And truly, it transforms messy notebooks into data pipelines like Apache Airflow, DVC, Argo, Kubeflow, and many more. And if you can't find your favorite orchestration framework, you are welcome to work with the creators of LineaPy to contribute a plugin for it!

In this talk, you will learn the basic concepts of LineaPy and how it supports your everyday tasks as a data practitioner. For this purpose, we will transform a notebook step by step together to create a DVC pipeline. Finally, we will discuss what place LineaPy will take in the MLOps universe. Will you only have to check in your notebook in the future?

PyCon: DevOps & MLOps

Honey, I broke the PyTorch model >.< - Debugging custom PyTorch models in a structured manner

Clara Hoffmann

When building PyTorch models for custom applications from scratch there's usually one problem: The model does not learn anything. In a complex project, it can be tricky to identify the cause: Is it the data? A bug in the model? Choosing the wrong loss function at 3 am after an 8-hour coding session?

In this talk, we will build a toolbox to find the culprits in a structured manner. We will focus on simple ways to ensure a training loop is correct, generate synthetic training data to determine whether we have a model bug or problematic real-world data, and leverage pytest to safely refactor PyTorch models.

After this talk, visitors will be well equipped to take the right steps when a model is not learning, quickly identify the underlying reasons, and prevent bugs in the future.

PyData: Deep Learning

B05-B06

How to teach NLP to a newbie & get them started on their first project

90min

Dr. Lisa Andreevna Chalaguine

The materials presented during this tutorial are open source and can be used by coaches and tutors who want to teach their students how to use Python for text processing and text classification. (A minimal understanding of programming (in any language) is required by the students)

PyData: Natural Language Processing

A03-A04

Joris Van den Bossche, Patrick Hoefler

Pandas 2.0 and beyond

Pandas has reached a 2.0 milestone in 2023. But what does that mean? And what is coming after 2.0? This talk will give an overview of what happened in the latest releases of pandas and highlight some topics and major new features the pandas project is working on.

PyData: PyData & Scientific Libraries Stack

Kuppelsaal

11:40

An unbiased evaluation of environment management and packaging tools

Anna-Lena Popkes

Python packaging is quickly evolving and new tools pop up on a regular basis. Lots of talks and posts on packaging exist but none of them give a structured, unbiased overview of the available tools.

This talk will shed light on the jungle of packaging and environment management tools, comparing them on a basis of predefined features.

PyCon: Programming & Software Engineering

Kuppelsaal

AutoGluon: AutoML for Tabular, Multimodal and Time Series Data

Caner Turkmen, Oleksandr Shchur

AutoML, or automated machine learning, offers the promise of transforming raw data into accurate predictions with minimal human intervention, expertise, and manual experimentation. In this talk, we will introduce AutoGluon, a cutting-edge toolkit that enables AutoML for tabular, multimodal and time series data. AutoGluon emphasizes usability, enabling a wide variety of tasks from regression to time series forecasting and image classification through a unified and intuitive API. We will specifically focus on tasks on tabular and time series tasks where AutoGluon is the current state-of-the-art, and demonstrate how AutoGluon can be used to achieve competitive performance on tabular and time series competition data sets. We will also discuss the techniques used to automatically build and train these models, peeking under the hood of AutoGluon.

PyData: Machine Learning & Stats

B05-B06

Hyperparameter optimization for the impatient

Martin Wistuba

In the last years, Hyperparameter Optimization (HPO) became a fundamental step in the training
of Machine Learning (ML) models and in the creation of automatic ML pipelines.
Unfortunately, while HPO improves the predictive performance of the final model, it comes with a significant cost both in terms of computational resources and waiting time.
This leads many practitioners to try to lower the cost of HPO by employing unreliable heuristics.

In this talk we will provide simple and practical algorithms for users that want to train models
with almost-optimal predictive performance, while incurring in a significantly lower cost and waiting
time. The presented algorithms are agnostic to the application and the model being trained so they can be useful in a wide range of scenarios.

We provide results from an extensive experimental activity on public benchmarks, including comparisons with well-known techniques like Bayesian Optimization (BO), ASHA, Successive Halving.
We will describe in which scenarios the biggest gains are observed (up to 30x) and provide examples for how to use these algorithms in a real-world environment.

All the code used for this talk is available on (GitHub)[https://github.com/awslabs/syne-tune].

PyData: Machine Learning & Stats

B09

Incorporating GPT-3 into practical NLP workflows

Ines Montani

In this talk, I'll show how large language models such as GPT-3 complement rather than replace existing machine learning workflows. Initial annotations are gathered from the OpenAI API via zero- or few-shot learning, and then corrected by a human decision maker using an annotation tool. The resulting annotations can then be used to train and evaluate models as normal. This process results in higher accuracy than can be achieved from the OpenAI API alone, with the added benefit that you'll own and control the model for runtime.

PyData: Natural Language Processing