Unlocking the Power of PySpark: A Comprehensive Workshop
05-19, 14:00–15:30 (Europe/Vilnius), Coral A - Workshop

Are you struggling with big data in your business? Join us to discover how PySpark can help you solve your problems efficiently and effectively. In this workshop, we will revisit the key concepts of PySpark, including parallel processing and lazy evaluation. We will explore DataFrames as a convenient layer of so called RDDs and work with an optimizer to get the most out of our transformations.

We'll also take a look the Spark UI, which allows us to monitor and optimize our processes. To put our knowledge into practice, we'll simulate a business problem and walk through the entire process of data preparation (preprocessing), training a model with MLLib, and performing inference on preprocessed test data. We'll also add a business logic layer to our solution for further customization (postprocessing).

Optional content includes lessons learned from large-scale production systems based on PySpark. We'll share insights on how to optimize performance and scale your solution to handle big data with ease.


Are you looking for a powerful tool to tackle your big data problems? PySpark may be just what you need. Join us for a comprehensive workshop on PySpark, where we'll cover everything from the basics of parallel processing and lazy evaluation to deploying production-level solutions on a large scale.

In this workshop, we'll start by visiting the key concepts of PySpark and exploring how it can help us solve business problems with potentially very large amounts of data. We'll then dive into working with DataFrames as a convenient layer of RDDs and utilizing an optimizer to get the most out of our data.

To help you put your new knowledge into practice, we'll simulate a real-world business problem and walk you through the entire process of data preparation, model training with MLLib, and performing inference on preprocessed test data. We'll also add a business logic layer to our solution for further customization.

Throughout the conference, we'll utilize the Spark UI to monitor and optimize our processes. We'll provide you with code that can be easily adapted to run on various platforms, including a cluster in AWS Glue and localhost.

In addition, we'll cover optional content on lessons learned from large-scale production systems based on PySpark. We'll share insights on how to optimize performance and scale your solution to handle big data with ease.

Whether you're just starting out with PySpark or looking to take your skills to the next level, this workshop is designed for you. Join us to discover how to harness the power of PySpark for big data and take your business to the next level.


What is a level of your talk

Intermediate

What topics define your talk the best?

python, open source, PyData, optimization and speed, ML engineering, data engineering

Carsten works as a data science consultant for Datadrivers, a consulting company based in Hamburg.
After working in risk management and graduating in mathematics, he entered die field five years ago. He focuses on the development of end2end AI solutions for customers in various industries, preferably in the cloud.