Building Large Scale ETL Pipelines with Dask
2024-09-25 , Gaston Berger

Building scalable ETL pipelines and deploying them in the cloud can seem daunting. It shouldn't be. Leveraging proper technologies can make this process easy. We will discuss the whole process of developing a composable and scalable ETL pipeline centred around Dask that is fully built with Open Source tools and how we can deploy to the cloud.


Running regular large-scale Python jobs on the cloud as part of production data pipelines is a common requirement. Building scalable ETL pipelines operating at scale and deploying them in the cloud can seem daunting. Managing different tools, building an infrastructure that can scale to large datasets and making sure that things run reliable has many complexities.

Modern workflow orchestration systems like Prefect, Dagster, Airflow, etc. all work well for running jobs on a regular cadence. Dask fulfils the scalability requirement and storage formats like Delta Lake or Iceberg can both deal with large scale data.

Dask is a library for distributed computing with Python and integrates tightly with pandas and other libraries from the PyData stack. It offers a DataFrame API that wraps pandas and thus offers an easy transition into the big data space. We will dive into recent Dask developments like a new shuffle algorithm, a query optimizer and other smaller changes and investigate how they improve scalability and performance to make Dask a great fit for large scale ETL pipelines. Gone are the days when Dask performed poorly on benchmarks, the new Dask DataFrame implementation can now beat Spark on the popular TPC-H benchmarks in cost and performance.

We will illustrate how all those tools fit very nicely and how we can glue them together to create a composable ETL pipeline. This includes the infrastructure and architecture of the pipeline. We will look at a live example how our pipeline operates. We discuss how we can test locally and seamlessly switch deployment to the cloud.

Patrick Hoefler is a member of the pandas core team and a Dask maintainer. He is currently working at Coiled where he focuses on Dask development and the integration of a logical query planning layer into Dask. He holds a Msc degree in Mathematics and works towards a Msc in Software engineering at the University of Oxford.