Scaling pandas to any size with PySpark
2023-08-17 , Aula

This talk discusses using the pandas API on Apache Spark to handle big data, and the introduction of Pandas Function APIs. Presented by an Apache Spark committer and a product manager, it offers technical and managerial insights.


Undoubtedly, pandas plays a crucial role in data wrangling and analysis tasks. However, its limitation lies in handling big data processing. This creates a dilemma for data practitioners: should they sacrifice information by downsampling the data, or should they explore distributed processing frameworks to handle larger workloads? One popular option is Apache Spark, a mainstream distributed processing tool. Yet, using Spark means learning a new language, PySpark, which can be a challenge.

Thankfully, there is a silver lining. The pandas API on Spark offers equivalent functionalities to pandas in PySpark. This allows pandas users to seamlessly transition from single-node to distributed environments by merely replacing the pandas package with pyspark.pandas.

Conversely, existing PySpark users may need to create custom user-defined functions (UDFs) that are not available in the PySpark API. With the introduction of Pandas Function APIs in Spark 3.0+, users can now apply arbitrary Python native functions with type hints, using pandas instances as input and output, on a PySpark dataframe. This empowers data scientists to train ML models based on each data group with just a single line of code.

And, you don't even need to write PySpark code now! English is the new programming language and we will introduce the English SDK for PySpark. The English SDK understands Spark tables and DataFrames, handles the complexity for you behind the scenes, and returns a DataFrame directly based on your English questions and directions.

In a joint presentation by a top open-source Apache Spark committer and a product manager, this talk has both the software engineer and product manager perspectives. Prior working knowledge of pandas, basic Spark, and machine learning will be helpful for the audience.


Category [Community, Education, and Outreach]

Science and Engineering Portals

Category [Machine and Deep Learning]

Other

Expected audience expertise: Domain

some

Category [Scientific Applications]

Other

Category [Data Science and Visualization]

Data Analysis and Data Engineering

Expected audience expertise: Python

some

Abstract as a tweet

Scaling data workloads using the best of both worlds: pandas and Spark

Category [High Performance Computing]

Parallel Computing

Hyukjin is a Databricks software engineer as the tech-lead in OSS PySpark team, Apache Spark PMC member and committer, working on many different areas in Apache Spark such as PySpark, Spark SQL, SparkR, infrastructure, etc. He is the top contributor in Apache Spark, and leads efforts such as Project Zen, Pandas API on Spark, and Python Spark Connect.

Allan is a product manager at Databricks mainly working on PySpark.
He is passionate about helping people make sense of data and has focused on that his whole career.