PyCon GR 2025

ETL Beyond Spark
2025-08-30 , Innovathens - Main stage

Would you use a threshing machine to mow your lawn? Probably not — yet many data engineers reach for Spark to run ETL pipelines, even when it's overkill. In reality, most of us deal with data volumes that are large, but not that large — think gigabytes, not petabytes. Tools like DuckDB, Polars, and Dask are challenging the default assumption that “big data problems need big frameworks.”

This talk is for Python developers, data engineers, and analytics engineers who work with batch data pipelines and want to simplify their stack without sacrificing performance. We’ll explore a comparative benchmark of modern Python-native ETL tools, using a data model representative of real-life scenarios — including both historical and incremental loads — and test their performance on datasets ranging from “fits in RAM” to multiple gigabytes.

You’ll see real-world code examples of common ETL transformations, understand the upfront setup and ongoing maintenance each tool requires, and learn how to choose the right tool based on your workload and team skills. Whether you’re building out a data platform or just cleaning up CSVs, this talk will help you rethink ETL for the modern Python ecosystem.