Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars
Florian Jetter, Patrick Hoefler
Dask is a library for distributed computing with Python that integrates tightly with pandas. Historically, Dask was the easiest choice to use (it’s just pandas) but struggled to achieve robust performance (there were many ways to accidentally perform poorly). The re-implementation of the DataFrame API addresses all of the pain points that users ran into. We will look into how Dask is a lot faster now, how it performs on benchmarks that is struggled with in the past and how it compares to other tools like Spark, DuckDB and Polars.
PyData: Data Handling & Engineering
B05-B06