Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars :: PyConDE & PyData Berlin 2024

Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars
.ical

2024-04-23 14:10–14:40, B05-B06

Dask is a library for distributed computing with Python that integrates tightly with pandas. Historically, Dask was the easiest choice to use (it’s just pandas) but struggled to achieve robust performance (there were many ways to accidentally perform poorly). The re-implementation of the DataFrame API addresses all of the pain points that users ran into. We will look into how Dask is a lot faster now, how it performs on benchmarks that is struggled with in the past and how it compares to other tools like Spark, DuckDB and Polars.

Dask is a library for distributed computing with Python that integrates tightly with pandas and other libraries from the PyData stack. It offers a DataFrame API that wraps pandas and thus offers an easy transition into the big data space.

Historically, Dask was the easiest choice to use (it’s just pandas) but struggled to achieve robust performance (there were many ways to accidentally perform poorly). It was great for experts, but bad for novices. Other tools (Spark, DuckDB, Polars) just did this better.

Fortunately, these pain points have been fixed with the following features:

A new and vastly improved shuffle algorithm
A logical query planning layer to improve performance and usability
A reduced memory footprint through a more efficient data model due to pandas 2.0

We will look into how these changes work together across pandas, Arrow, and Dask to provide a better UX and a more robust and faster system overall. Additionally, we will look into a comparison of Dask against other tools in the big data space, including Spark, Polars and DuckDB.

We will use the TPC-H benchmarks to compare these tools. We will look ahead into what the future will bring for pandas and Dask and how the logical query planning layer can be extended to fit other frameworks like Dask Array and XArray.

Expected audience expertise: Domain:

Novice

Expected audience expertise: Python:

Novice

Abstract as a tweet (X) or toot (Mastodon):

Dask DataFrame is fast now - The re-implementation of DataFrames in Dask is fast, reliable and fun.

Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars .ical 2024-04-23 14:10–14:40, B05-B06

Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars
.ical

2024-04-23 14:10–14:40, B05-B06