Python Conference APAC 2024

Profile, debug and monitor my PySpark workloads
2024-10-26 , CLASS #4
Language: English

PySpark is widely used for data analysis in distributed computing environments, offering not only the standard DataFrame API but also Python User Defined Functions (UDFs), Python Data Sources, Python UDTFs, and more. However, debugging and profiling applications in such environments can be challenging. For example, you cannot simply add a step and inspect it in your IDE.

In this presentation, I will explore techniques for debugging and profiling PySpark applications using existing tools like cProfile, a standard Python profiler. Additionally, I'll share various tips and best practices for effectively monitoring and debugging PySpark applications.

Hyukjin is a Databricks software engineer as the tech-lead in OSS PySpark team, ASF member, Apache Spark PMC member and committer, working on many different areas in Apache Spark such as PySpark, Spark SQL, SparkR, infrastructure, etc. He is the top contributor in Apache Spark, and leads efforts such as Project Zen, Pandas API on Spark, and Python Spark Connect.