PyCon Hong Kong 2024

PyCon Hong Kong 2024

How do I debug my PySpark workloads?
2024-11-16 , LT7
Language: English

PySpark is widely adopted for data analysis in distributed computing environments. It supports not only the standard DataFrame API but also Python User Defined Functions (UDFs), Python Data Sources, Python UDTFs, and more. However, debugging and profiling applications in such distributed environments are often challenging - you can't simply add a breakpoint and inspect variables in your IDE.

In this presentation, I will demonstrate effective methods for debugging and profiling PySpark applications using existing tools. These include profiling tools that utilize cProfile, a standard Python profiler, along with various tricks and best practices for monitoring and debugging PySpark applications.

Hyukjin is a Databricks software engineer as the tech-lead in OSS PySpark team, ASF member, Apache Spark PMC member and committer, working on many different areas in Apache Spark such as PySpark, Spark SQL, SparkR, infrastructure, etc. He is the top contributor in Apache Spark, and leads efforts such as Project Zen, Pandas API on Spark, and Python Spark Connect.

Allison is a software engineer at Databricks, working on Spark SQL and PySpark. She holds a Bachelor’s degree in Computer Science from Carnegie Mellon University.