Unlocking the Power of PySpark: A Comprehensive Workshop

Room: Coral A - Workshop

Date: 2023-05-19

Time: 14:00 - 15:30

Abstract

Are you struggling with big data in your business? Join us to discover how PySpark can help you solve your problems efficiently and effectively. In this workshop, we will revisit the key concepts of PySpark, including parallel processing and lazy evaluation. We will explore DataFrames as a convenient layer of so called RDDs and work with an optimizer to get the most out of our transformations. We'll also take a look the Spark UI, which allows us to monitor and optimize our processes. To put our knowledge into practice, we'll simulate a business problem and walk through the entire process of data preparation (preprocessing), training a model with MLLib, and performing inference on preprocessed test data. We'll also add a business logic layer to our solution for further customization (postprocessing). Optional content includes lessons learned from large-scale production systems based on PySpark. We'll share insights on how to optimize performance and scale your solution to handle big data with ease.

Carsten Frommhold

Carsten works as a data science consultant for Datadrivers, a consulting company based in Hamburg. After working in risk management and graduating in mathematics, he entered die field five years ago. He focuses on the development of end2end AI solutions for customers in various industries, preferably in the cloud.