How do I debug my PySpark workloads? PyCon Hong Kong 2024

How do I debug my PySpark workloads?
.ical
2024-11-16 14:00–14:30, LT7
Language: English

PySpark is widely adopted for data analysis in distributed computing environments. It supports not only the standard DataFrame API but also Python User Defined Functions (UDFs), Python Data Sources, Python UDTFs, and more. However, debugging and profiling applications in such distributed environments are often challenging - you can't simply add a breakpoint and inspect variables in your IDE.

In this presentation, I will demonstrate effective methods for debugging and profiling PySpark applications using existing tools. These include profiling tools that utilize cProfile, a standard Python profiler, along with various tricks and best practices for monitoring and debugging PySpark applications.

Hyukjin Kwon

Hyukjin is a Databricks software engineer as the tech-lead in OSS PySpark team, ASF member, Apache Spark PMC member and committer, working on many different areas in Apache Spark such as PySpark, Spark SQL, SparkR, infrastructure, etc. He is the top contributor in Apache Spark, and leads efforts such as Project Zen, Pandas API on Spark, and Python Spark Connect.

Allison Wang

Allison is a software engineer at Databricks, working on Spark SQL and PySpark. She holds a Bachelor’s degree in Computer Science from Carnegie Mellon University.

How do I debug my PySpark workloads? .ical 2024-11-16 14:00–14:30, LT7 Language: English

How do I debug my PySpark workloads?
.ical
2024-11-16 14:00–14:30, LT7
Language: English