All Them Data Engines: Pandas, Spark, Dask, Polars and more - Data Munging with Python circa 2023.

shaurya shaurya3 (~shaurya3) | 30 Jun, 2023

3

Votes

Description:

Introduction

Versatility. / ˌvɜr səˈtɪl ɪ ti / noun: ability to adapt or be adapted to many different functions or activities.

Often our ecosystems limit us to one technology stack/framework/solution that we end up working on day-to-day. Maybe because the framework was chosen for us, maybe it's the one available at hand, maybe that's the skill most prevalent in the team, maybe it was chosen by following a decision analysis process, maybe other vagaries of the workplace were in play.

This is incredibly limiting in developing an intuition for problem solving, exploring the possibilities and simply being able to use the right tool for the right job.

In trying to gain experience on a new framework on our own, we are inundated with myriad concepts, jargon and "technical evangelism" so much that getting to the practical stuff often becomes an uphill battle for most of us.

This workshop aims to address this fundamental issue:
1. Get hands-on experience across some of the most in-demand data engineering frameworks around today - Pandas, Spark, Dask, Polars etc.
2. Focus on the one core thing - data munging - shaping data, analyzing it and deriving insights.

In this interactive 3-hour workshop, fellow data engineers will explore and gain practical experience with some of the industry's most sought-after data engineering frameworks. Through a series of engaging exercises and real-world-like examples, fellow attendees will be empowered to tackle data engineering challenges efficiently and effectively.

Prerequisites: Low barrier to entry

The workshop uses Jupyter Notebooks (use Anaconda installation locally or Google Colab), GitHub (for collaborations, questions, discussions etc.) and popular datasets (everyone likes movies - we use the MovieLens dataset here for several practical exercises).

You should have basic familiarity with Python and Jupyter.

Either have a working installation of Anaconda (or any of it's flavors - miniconda, mamba, others) or have access to Google Colab. While you can also use Binder, I have not tested the notebooks on it.

Talk Outline and Approach

The full workshop (WIP at the time of this proposal submission - openly available on Github) will be based around "Problem Sets" - practical questions that we'll ask about the data, then try to answer these questions using the data engineering frameworks.

The workshop will have the following sections, 25 to 30 minutes each:
1. Pandas
2. Spark
3. Dask
4. Polars
5. (Optionally, if time permits) - Apache Arrow DataFusion and Ray

This will ensure we cover a wide gamut of how to think about distributed computing problems, the different strengths and weaknesses of each of these data engineering systems and aim to build a level of comfort and familiarity with the various systems so that investigating further or picking up a new system in future becomes far easier than it is today.

Continually Evolving

The intent is to keep this workshop evolving, keeping up with the fast in flux data engineering field.

The open GitHub repo will carry a full list of references, as well as sidebars - that capture lessons big and small that we encounter solving problems.

While the session will be over in 3 hours, the collaboration can continue as we can all fork the repo, contribute PRs, submit questions and issues etc. My aim is to keep adding more notebooks for all emergent data engineering systems that I feel may be important. So we'll all have a resource on GitHub that will continue to generate value long after our 3 hour session is over.

Takeaways

By the end of this workshop, fellow participants will have a understand the core concepts behind these frameworks and their applications. They will be equipped with practical skills and techniques to efficiently manipulate, process, and analyze data using Pandas, Spark, Dask, Polars etc. The most important takeaway? Valuable insights into selecting the right framework for the right job.

Prerequisites:

Working know-how of Python, some familiarity with GitHub
Working installation of Anaconda or any similar distribution (prefer - public, open, free) or access to Google Colab
Optionally - working installations of Pandas, pySpark, Dask and Polars - this to ensure minimal time is wasted in 'set-up' related tasks. The current notebooks on GitHub already carry instructions for installation.
Possibly a working internet connection - just so we can access the data and collaborate on Github

Content URLs:

Github repo for the workshop (Jupyter Notebooks): https://github.com/shauryashaurya/learn-data-munging

Speaker Info:

Shaurya Agarwal, Deputy Head - Engineering, at Barnes and Noble (BNED LoudCloud).

With 20+ years of experience in Analytics & ML, Big Data and Cloud Computing, Shaurya is leading the engineering teams at BNED that are working on building the next generation of data products for the company.

Speaker Links:

Github: https://github.com/shauryashaurya
LinkedIn: https://www.linkedin.com/in/shauryashaurya/
Twitter: @shauryashaurya (https://twitter.com/shauryashaurya)
Talks:
- Recent panel discussion at DataOps Observability Conf 2023: https://www.youtube.com/live/GM1EzNChtdk?feature=share&t=2884
- Panel discussion at Ashnik's Data Pipeline & Observability Insights conference 2022 (https://www.ashnik.com/events/bfsi-datapipeline-and-observability-platform-event-mumbai-india/). This was a invite-only event (CxO layer and leadership of some of India's largest Banks and open-source technology companies were in attendance), but there's a video with highlights: https://youtu.be/mAulCd-XJLU?t=352

Section:	Data Science, AI & ML
Type:	Workshops
Target Audience:	Intermediate
Last Updated:	14 Sep, 2023

Comments