Introducing FireDucks: a must-have DataFrame library to accelerate your voluminous data analysis with pandas at zero cost

Sourav Saha (~sourav0)


4

Votes

Description:

It is said that data preparation is the most important and time-consuming task performed during the CRISP-DM (Cross-Industry Standard Process for Data Mining) process. Traditional tools such as Pandas have long been the linchpin in this process, offering powerful capabilities, but with numerous possible ways to write the same thing in pandas, a user often ends up selecting the inefficient ones. As a result, when the program grows in size, along with the increase in data volume and complexity, it starts taking longer time delaying the overall analysis for that user. To mitigate the challenge, there is a strong need for high-performance data frame libraries, but the available ones either compel a user to learn completely new APIs (incurring migration cost) or to switch to a more efficient computational system (incurring hardware cost).

We introduce a high-performance data frame library, named FireDucks, that comes with a multithreaded backend written in C++, a JIT compiler to auto-detect and optimize the existing performance issues involved in a user program, and a pure python frontend highly compatible with pandas, allowing a pandas user to experience significant speedup (sometimes more than 100x) even on CPU-only systems without any migration cost (manual code changes). With the promise of highly compatible pandas APIs and the revved-up performance, FireDucks can serve the demands of this digital age and transform the arduous task of data wrangling into a more efficient and less taxing endeavor.

In this talk, we will explain:

  • some intricate performance issues that frequently occur in a pandas program.
  • how the choice and execution order of API calls in writing an application (not limited to pandas) impacts its performance.
  • the optimization strategies used in FireDucks to auto-detect and optimize the existing performance issues in a pandas application without any manual intervention.
  • comparative analysis among FireDucks and other existing high-performance pandas alternatives in terms of runtime performance and API compatibility.

FireDucks is freely available on pypi.org under the 3-Clause BSD License and can be simply installed using pip on a Linux-based system.

Presenting a small demo:

Find a small demo!

Prerequisites:

Anyone with a basic understanding of data analysis/preparation with pandas.

Video URL:

https://www.loom.com/share/c84bd551f2ed4c28997ef3ca6a669641?sid=dffd8ccb-8e53-43b4-927c-898545d75281

Content URLs:

  • Homepage: https://fireducks-dev.github.io
  • Blogs: https://fireducks-dev.github.io/posts
  • Twitter: https://twitter.com/fireducksdev
  • Sample Presentation: https://fireducks-dev.github.io/files/20240712_FireDucksIntro_Sacpy.pdf

Speaker Info:

Hello, my name is Sourav Saha. I have 11+ years of professional experience at NEC Corporation in the diverse fields of High-Performance Computing, Distributed Programming, Compiler Design, and Data Science. Currently, my team at NEC R&D Lab, Japan, is researching various data processing-related algorithms. Blending the mixture of different niche technologies related to compiler framework, high-performance computing, and multi-threaded programming, we have developed a Python library named FireDucks with highly compatible pandas APIs for DataFrame-related operations. In my previous engagements, I have worked in research and development of performance critical AI and Big Data solutions, optimization of several legacy applications related to weather prediction, earth-quake simulation etc written in C++ and Fortran. I have been speaking in several meetups, technical conferences related to HPC and Data Science. Looking forward to interating with you at PyCon India this year!

Speaker Links:

  • E-mail: sourav-saha@nec.com
  • Medium blogs: https://medium.com/@qsourav
  • Qiita blogs: https://qiita.com/qsourav
  • SacPy Talk: https://www.youtube.com/watch?v=CxSgl4ZXhzE&t=1572s
  • LinkedIn: https://www.linkedin.com/in/sourav-%E3%82%BD%E3%82%A6%E3%83%A9%E3%83%96-saha-%E3%82%B5%E3%83%8F-a5750259/

Section: Python in Platform Engineering and Developer Operations
Type: Talk
Target Audience: Intermediate
Last Updated: