Supercharging ML Data Processing with PySpark Optimizations

Unmesh Padalkar (~unmesh) | 31 May, 2024

2

Votes

Description:

Large-scale data processing in machine learning (ML) often involves dealing with massive datasets that require distributed computing frameworks to handle efficiently. PySpark, the Python API for Apache Spark, is widely used for this purpose. PySpark optimizations can significantly enhance the performance and scalability of ML workflows. At Dream11, data processing pipelines need to scaled efficiently to build model for millions of users (200 million registered users). This talk will delve into some of the key pypsark optimizations that are used at Dream11. This talk will cover:

Efficient Data Ingestion Partitioning: Proper partitioning of data during ingestion can improve query performance by reducing the amount of data shuffled across the network.
Data Partitioning Repartitioning: Repartition data into a suitable number of partitions to balance between parallelism and the overhead of managing too many partitions. Coalescing: Reduce the number of partitions for downstream operations to minimize shuffling and improve efficiency.
Optimizing Transformations Avoid Wide Transformations: Narrow transformations (e.g., map, filter) are more efficient as they don't require data shuffling across the cluster. Pipeline Aggregations: Combine multiple transformations into a single stage to minimize shuffling and reduce the execution plan complexity.
Efficient Joins Broadcast Joins: Use broadcast joins when one of the datasets is small enough to fit in memory. This avoids shuffling large datasets. Skewed Data Handling: Address data skew by salting keys or using custom partitioning to ensure an even distribution of data. Bucketed Joins: Avoid shuffling all together by transforming regular joins into bucketed joins

Prerequisites:

This talk is directed towards data engineers, data scientists, and machine learning practitioners who are responsible for building, optimizing, and maintaining large-scale data processing pipelines and ML workflows. These professionals typically have a working knowledge of distributed computing, experience with PySpark.

Speaker Info:

Unmesh is a Senior Machine Learning Scientist at Dream11, where he builds large-scale machine learning pipelines for the personalization team. With over 6+ years of experience in developing and deploying machine learning models at scale, Unmesh has worked extensively in the areas of fantasy gaming, forecasting, and finance. He holds a master's degree in Operations Research from Cornell University and an undergraduate degree from IIT Bombay.

Speaker Links:

LinkedIn: https://www.linkedin.com/in/unmesh-padalkar?trk=contact-info

Github: https://github.com/UnmeshP

Section:	Python in Platform Engineering and Developer Operations
Type:	Talk
Target Audience:	Intermediate
Last Updated:	01 Jun, 2024

Comments