Hands-on exercises to experience how compiler technology can be useful to speedup data processing in python

Sourav Saha (~sourav0)


0

Votes

Description:

Research says that Data scientists spend about 45% of their time on data preparation tasks, including loading and cleaning the data. Pandas is one of the most popular data frame libraries because of its diverse utilities and large community support, but when the program grows in size along with the increase in data volume, it starts showing poor performance. The key factors behind its performance issues are its single-core execution model, python-based implementation for most of the dataframe operations and its inefficient data structures. Although these are the major drawbacks at its core level, we can still optimize the performance of a pandas program, if written very carefully. Most of the times we perform some set of operations which consumes a significant amount of execution time and computational memory, that can actually be avoided for the end result we are interested in. Do you know we can optimize a data processing application written in pandas-like python library in a similar way a C/C++ compiler optimizes a C/C++ application? In this workshop, I will introduce some technologies and concept that would enable you to think efficiently when developing performance critical applications and saves a significant amount of execution times and computational memory. The outline of the session would be as follows:

  • introduce some intricate performance issues that frequently occur in a pandas program. (15 mins)
  • show some practical use-cases to demonstrate how efficient way of writing an application in pandas can help you to save a significant amount of computational time. (25 mins)
  • explain the idea how a compiler technology can be used to do the same automatically. (15 mins)
  • explain lazy execution model, its pros and cons. (5 mis)
  • introduce the compiler-accelerated python dataframe library, named FireDucks that we have developed at NEC R&D Lab to automate the optimization strategies explained above. (10 mis)
  • explain different features in FireDucks that makes it suitable to speedup your data processing task at zero cost. (30 mins)
  • explain the profiling strategies to find the performance bottleneck in your existing program when executed using FireDucks. (15 mins)
  • demonstrate how FireDucks can outperform other high-performance dataframe libraries (like polars, modin etc.). (5 mins)
  • live coding with an example use-case to experience the speedup right on the spot. (45 mins)
  • Q/A and other technical discussion. (15 mins)

FYI: FireDucks is freely available on pypi.org under the 3-Clause BSD License and can be simply installed using pip on a Linux-based system.

Prerequisites:

Anyone with a basic understanding of data analysis/preparation with pandas.

Video URL:

https://www.loom.com/share/c84bd551f2ed4c28997ef3ca6a669641?sid=dffd8ccb-8e53-43b4-927c-898545d75281

Content URLs:

  • Homepage: https://fireducks-dev.github.io
  • Blogs: https://fireducks-dev.github.io/posts
  • Twitter: https://twitter.com/fireducksdev
  • Sample Presentation: https://fireducks-dev.github.io/files/20240712_FireDucksIntro_Sacpy.pdf

Speaker Info:

Hello, my name is Sourav Saha. I have 11+ years of professional experience at NEC Corporation in the diverse fields of High-Performance Computing, Distributed Programming, Compiler Design, and Data Science. Currently, my team at NEC R&D Lab, Japan, is researching various data processing-related algorithms. Blending the mixture of different niche technologies related to compiler framework, high-performance computing, and multi-threaded programming, we have developed a Python library named FireDucks with highly compatible pandas APIs for DataFrame-related operations. In my previous engagements, I have worked in research and development of performance critical AI and Big Data solutions, optimization of several legacy applications related to weather prediction, earth-quake simulation etc written in C++ and Fortran. I have been speaking in several meetups, technical conferences related to HPC and Data Science. Looking forward to interating with you at PyCon India this year!

Speaker Links:

  • E-mail: sourav-saha@nec.com
  • Medium blogs: https://medium.com/@qsourav
  • Qiita blogs: https://qiita.com/qsourav
  • SacPy Talk: https://www.youtube.com/watch?v=CxSgl4ZXhzE&t=1572s
  • LinkedIn: https://www.linkedin.com/in/sourav-%E3%82%BD%E3%82%A6%E3%83%A9%E3%83%96-saha-%E3%82%B5%E3%83%8F-a5750259/

Section: Python in Platform Engineering and Developer Operations
Type: Workshops
Target Audience: Intermediate
Last Updated: