, Europium [3rd Floor]
Python UDFs often become the slowest part of PySpark pipelines because they run row-by-row and pay a high cost crossing the JVM↔Python boundary. Spark’s Arrow-backed execution changes that cost model by moving data in columnar batches, which can reduce overhead and enable efficient, vectorized processing in Python.
In this session, we’ll cover practical patterns for writing Arrow-friendly UDF logic and integrating it with fast Python execution engines that operate on Arrow data. We’ll compare common approaches—scalar UDFs, Pandas UDFs, Arrow-native UDFs, and table-shaped Arrow transforms—then translate the results into a decision guide you can apply to production pipelines. Attendees will leave knowing when Arrow helps, when it doesn’t, and how to design UDF-heavy transformations that scale.
Objective
Demonstrate how to accelerate UDF-heavy PySpark workloads by switching from row-wise execution to Arrow-backed columnar execution, using Polars for fast, maintainable column transformations and table transformations.
Key Takeways
- How Arrow is being used in PySpark for batched, columnar data exchange
- Why Polars helps: a higher-level DataFrame API plus Arrow interoperability that can often reuse Arrow buffers
- How to design fast column transformations (column in → column out) and fast table transformations (batch/table in → batch/table out).
- Benchmarks and tradeoffs across scalar UDFs, Pandas UDFs, Arrow-native UDFs, and Polars-based Arrow table transforms on real-world examples.
Audience
- Data engineers and data scientists working with PySpark at scale
- Engineers seeking concrete strategies to optimize spark pipelines that rely on Python UDFs
Knowledge Expected
- Familiarity with PySpark DataFrames and UDFs
- Basic understanding of Spark execution helps but is not required
- Exposure to Polars/Arrow is not required but might be beneficial
Aimilios works as a software engineer for Frontiers Media SA. With a passion for solving technical challenges and a commitment to sharing his knowledge in different aspects of computer engineering, including but not limited to ETL pipelines and optimization, improving the in-house tooling, contributing to different architectural decisions, he makes a valuable contribution to his team's objectives. Prior to joining Frontiers, he gained experience working as a Devops engineer at CERN, where he actively contributed in projects related to cloud computing and disaster recovery, automation, observability and databases. He holds a MEng in Electrical and Computer Engineering from National Technical University of Athens.