Speed up Pandas with Modin
Raj Rakesh (~raj84) |
Accelerate your pandas workflows by changing one line of code.
Pandas is a library which needs no introduction in the field of data science. It provides high-performance, easy-to-use data structures and data analysis tools. However, when working with excessively large amounts of data, Pandas on a single core becomes insufficient and people have to resort to different distributed systems to increase their performance.
The trade-off for improved performance, however, comes with a steep learning curve. Essentially users probably just want Pandas to run faster and aren’t looking to optimize their workflows for their particular hardware setup. This means people want to use the same Pandas script for their 10KB dataset as their 10TB dataset.
Modin offers to provide a solution by optimizing pandas so that Data Scientists spend their time extracting value from their data than on tools that extract data.
Through this talk we will learn how to use Ray to scale new and existing Python code. It will cover the Ray system architecture, example applications, GPU support, and best practices. We cover the underlying architecture of Modin and how it is able to scale Pandas data frames for huge data sets. We describe how Modin can be used in your existing scripts with just one line of code change and what all Pandas data frame API's have been covered as part of Modin.
- Challenges with Big Data Handling with Pandas ?
- What is Modin ?
- Modin Architecture
- Scaling Pandas Dataframe
- Improvements in Run-time & Efficiency.
- Performance Comparison.
Target Audience :
- Data Science Professionals
- Data Engineering Developers
- Big Data Developers
- Exposure to Pandas.
This talk is inspired by the research at rise lab at UC Berkley to improved Pandas performance for heavy workloads.
Raj is Solution Architect - IoT Cloud Platforms at Hitachi Consulting with over 7+ years of industry experience in Data Engineering and Data Science. He holds 4 Google Cloud Professional certifications and is passionate about data. He has worked extensively in the field of Data Engineering across different Big Data Processing frameworks and now on Public Clouds. His favorite language to code is Python alongside Go for all his Data Engineering and Science work.