Pandas vs Polars: The Evolution of Data Manipulation in Python

Mradul Jain (~mradul)


1

Vote

Description:

Abstract:

In the ever-evolving landscape of data science, the tools we use play a pivotal role in determining the efficiency and scalability of our solutions. Pandas, a stalwart in the Python data manipulation realm, has been the go-to library for many. However, the emergence of Polars, a high-performance DataFrame library, has stirred the pot, offering blazing speeds and memory efficiency. This presentation delves into a comparative analysis of Pandas and Polars, shedding light on their performance metrics, underlying architectures, and suitability for various data tasks. Drawing insights from recent benchmarks, we'll explore how Polars, built on Rust, offers significant speed advantages, especially for big data operations. Additionally, with the advent of Pandas 2.0, the dynamics of this comparison have further evolved. Join us as we navigate this intriguing battle of performance, exploring which library holds the torch for the future of data manipulation in Python.

Key Takeaways:

  • Deep Dive into Performance Metrics: Gain a comprehensive understanding of how Polars consistently showcase superior performance over Pandas in various data manipulation tasks. Discover real-world benchmarks, such as Polars being 34 times faster than Pandas for the 'sum()' function and 10 times faster for the 'apply()' function.
  • Insights into Underlying Architectures: Unravel the technical intricacies behind Polars' edge. Learn about its foundation in Rust, its utilization of SIMD (Single Instruction Multiple Data) operations, and the benefits of lazy evaluation. Understand how these elements collectively contribute to Polars' blazing speed and efficiency.
  • Strategic Use Cases: Equip yourself with the knowledge to make informed decisions on which library to use for your projects. Understand the strengths and weaknesses of both Pandas and Polars, and learn why Polars is emerging as the go-to choice for big data projects, while Pandas remains dominant for smaller datasets.

Prerequisites:

Familiar with Python programming and Pandas package

Speaker Info:

As a highly passionate Data Science professional with over 11 years of experience, I have consistently demonstrated my expertise in Machine Learning, Deep Learning, Analytics Product Development, and Big Data Stack. I have a postgraduate degree in Business Analytics from SCMHRD in Pune and a B.E from PESIT in Bangalore.

Throughout my career, I have had the opportunity to work with Fortune 500 companies such as IBM and Genpact, as well as leading Australian retailer Kmart and startup Culture Machine. Currently, I am the leader of the Data Science team at AB-InBev, the largest beer company in the world.

I have a strong track record in developing and managing advanced data science teams, providing technical mentoring, and effectively managing stakeholders to drive value through analytics products

In recognition of my contributions, I've been honored with four national accolades in Analytics Competitions and was bestowed with the Best AI Leader Award at the Future of Finance Summit.

Additionally, I had the privilege to be a speaker at the Fifth Elephant Conference, where I delved into the nuances of the ML lifecycle in credit risk modeling, specifically tailored for CPG.

Speaker Links:

Speaker Linkedin Profile - https://www.linkedin.com/in/mraduljain1/

Section: Data Science, AI & ML
Type: Talks
Target Audience: Intermediate
Last Updated: