PyConDE & PyData Berlin 2024

Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars
04-23, 14:10–14:40 (Europe/Berlin), B05-B06

Dask is a library for distributed computing with Python that integrates tightly with pandas. Historically, Dask was the easiest choice to use (it’s just pandas) but struggled to achieve robust performance (there were many ways to accidentally perform poorly). The re-implementation of the DataFrame API addresses all of the pain points that users ran into. We will look into how Dask is a lot faster now, how it performs on benchmarks that is struggled with in the past and how it compares to other tools like Spark, DuckDB and Polars.


Dask is a library for distributed computing with Python that integrates tightly with pandas and other libraries from the PyData stack. It offers a DataFrame API that wraps pandas and thus offers an easy transition into the big data space.

Historically, Dask was the easiest choice to use (it’s just pandas) but struggled to achieve robust performance (there were many ways to accidentally perform poorly). It was great for experts, but bad for novices. Other tools (Spark, DuckDB, Polars) just did this better.

Fortunately, these pain points have been fixed with the following features:

  • A new and vastly improved shuffle algorithm
  • A logical query planning layer to improve performance and usability
  • A reduced memory footprint through a more efficient data model due to pandas 2.0

We will look into how these changes work together across pandas, Arrow, and Dask to provide a better UX and a more robust and faster system overall. Additionally, we will look into a comparison of Dask against other tools in the big data space, including Spark, Polars and DuckDB.

We will use the TPC-H benchmarks to compare these tools. We will look ahead into what the future will bring for pandas and Dask and how the logical query planning layer can be extended to fit other frameworks like Dask Array and XArray.


Abstract as a tweet (X) or toot (Mastodon)

Dask DataFrame is fast now - The re-implementation of DataFrames in Dask is fast, reliable and fun.

Expected audience expertise: Domain

Novice

Expected audience expertise: Python

Novice

See also: Slides (2.4 MB)

Patrick Hoefler is a member of the pandas core team and a Dask maintainer. He is currently working at Coiled where he focuses on Dask development and the integration of a logical query planning layer into Dask. He holds a Msc degree in Mathematics and works towards a Msc in Software engineering at the University of Oxford.