, Helium [3rd Floor]
As data engineers, we are used to spinning up a Spark Cluster every time we want to do data processing and handle the overhead that comes with using such a mighty framework. But is this really necessary? In this talk I will argue that single-node processing with Polars is in many cases easier and cheaper. I will compare a typical ETL & Feature Engineering task in Spark and in Polars and offer a pragmatic opinion on when to use one or the other.
Apache Spark is the industry standard for big data processing, rightfully so. But for many data processing applications, a more light-weight solution will work just as well, avoiding Spark's compute and configuration overhead. Polars offers such a solution, with a fast single-node processing engine and a syntax that will pose no problems for experienced Spark developers.
I will give a short comparison of Spark and Polars, where they have similarities and differences and show an implementation of a typical ETL and Feature Engineering task in both. I will compare the deployment, performance and cost of the two and, while giving my opinion on the topic, hope to enable you to also make an informed decision on when you want to use Polars and when to use Spark.
Data Engineer at inovex since 2022, full-time software engineer since 2018, coder for as long as I can remember. With my experience working on data warehouses and machine learning applications from small-scale tests up to international deployments, I enjoy eliminating bugs and bottlenecks, getting cool systems online and writing beautiful code. Still proud of the time when a colleague complained that deploying to production has become too easy and is no longer a thrilling adventure because of me.