Observability for Distributed Computing with Dask PyCon DE & PyData Berlin 2023

Observability for Distributed Computing with Dask
.ical
2023-04-18 11:40–12:10, B09

Debugging is hard. Distributed debugging is hell.

Dask is a popular library for parallel and distributed computing in Python. Dask is commonly used in data science, actual science, data engineering, and machine learning to distribute workloads onto clusters of many hundreds of workers with ease.

However, when things go wrong life can become difficult due to all of the moving parts. These parts include your code, other PyData libraries like NumPy/pandas, the machines you’re running on, the network between them, storage, the cloud, and of course issues with Dask itself. It can be difficult to understand what is going on, especially when things seem slower than they should be or fail unexpectedly. Observability is the key to sanity and success.

In this talk, we describe the tools Dask offers to help you observe your distributed cluster, analyze performance, and monitor your cluster to react to unexpected changes quickly. We will dive into distributed logging, automated metrics, event-based monitoring, and root-causing problems with diagnostic tooling. Throughout the talk, we will leverage real-world use cases to show how these tools help to identify and solve problems for large-scale users in the wild.

This talk should be particularly insightful for Dask users, but the approaches to observing distributed systems should be relevant to anyone operating at scale in production.

Debugging is hard. Distributed debugging is hell.

This talk should be particularly insightful for Dask users, but the approaches to observing distributed systems should be relevant to anyone operating at scale in production.

Expected audience expertise: Domain: Intermediate Expected audience expertise: Python: Intermediate Abstract as a tweet:

Debugging is hard. Distributed debugging is hell. Let’s dive into distributed logging, automated metrics, event-based monitoring, and root-causing problems with diagnostic tooling to understand how Dask helps you remain sane while identifying and solving your problems.

Hendrik Makait

Hendrik Makait is a data and software engineer building systems at the intersection of large-scale data management and machine learning. Currently, he works as an Open Source Engineer at Coiled improving Dask and its distributed execution engine.

Observability for Distributed Computing with Dask .ical 2023-04-18 11:40–12:10, B09

Observability for Distributed Computing with Dask
.ical
2023-04-18 11:40–12:10, B09