Hyukjin Kwon
Hyukjin is a Databricks software engineer as the tech-lead in OSS PySpark team, ASF member, Apache Spark PMC member and committer, working on many different areas in Apache Spark such as PySpark, Spark SQL, SparkR, infrastructure, etc. He is the top contributor in Apache Spark, and leads efforts such as Project Zen, Pandas API on Spark, and Python Spark Connect.
Hong Kong
Company / Organisation –Apache Software Foundation
Session
PySpark is widely adopted for data analysis in distributed computing environments. It supports not only the standard DataFrame API but also Python User Defined Functions (UDFs), Python Data Sources, Python UDTFs, and more. However, debugging and profiling applications in such distributed environments are often challenging - you can't simply add a breakpoint and inspect variables in your IDE.
In this presentation, I will demonstrate effective methods for debugging and profiling PySpark applications using existing tools. These include profiling tools that utilize cProfile, a standard Python profiler, along with various tricks and best practices for monitoring and debugging PySpark applications.