Data of an Unusual Size: A practical guide to analysis and interactive visualization of massive datasets

pavithraes


0

Votes

Description:

While most folks aren't at the scale of cloud giants or black hole research teams that analyze Petabytes of data every day, you can easily fall into a situation where your laptop doesn't have quite enough power to do the analytics you need.

"Big data" refers to any data that is too large to handle comfortably with your current tools and infrastructure. As the leading language for data science, Python has many mature options that allow you to work with datasets that are orders of magnitudes larger than what can fit into a typical laptop's memory.

In this hands-on tutorial, you will learn the fundamentals of analyzing massive datasets with real-world examples on actual powerful machines on a public cloud – starting from how the data is stored and read, to how it is processed and visualized.

You will understand how large-scale analysis differs from local workflows, the unique challenges associated with scale, and some best practices to work productively with your data.

By the end, you will be able to answer:

  • What makes some data formats more efficient at scale?
  • Why, how, and when (and when not) to leverage parallel and distributed computation (primarily with Dask) for your work?
  • How to manage cloud storage, resources, and costs effectively?
  • How can interactive visualization make large and complex data more understandable (primarily with hvPlot)?
  • How to comfortably collaborate on data science projects with your entire team?

The tutorial focuses on the reasoning, intuition, and best practices around big data workflows, while covering the practical details of Python libraries like Dask and hvPlot that are great at handling large data. It includes plenty of exercises to help you build a foundational understanding within three hours.

🪄 Side note: All participants will get access to cloud resources for the tutorial using Nebari, an open source JupyterHub distribution. We will share details on how to sign-up to the platform before the tutorial, but participants can sign-up on the spot as well. Note that you can sign-up anonymously and do not need to share any personally identifiable information.


Workshop outline 🎯

Introduction [15 mins]

  • Motivating example (live dashboard that participants will build by the end of the tutorial)
  • Get participants set up with tutorial material
  • Create an initial data science environment with a quick introduction to conda Tutorial overview

Data storage and formats [15 mins]

  • Reading and comparing CSVs and Parquet files with hands-on examples
  • Understanding the Parquet format
  • Brief discussion on best practices around cloud storage for big data

Analyze a subset of data [15 mins]

  • Introduce the dataset
  • Read, explore the data, and create a processing workflow with pandas
  • Exercises so participants can familiarize with the dataset and manipulate it comfortably

Visualize a subset of data [20 mins]

  • Introduction to interactive visualization (and the Python viz landscape)
  • Create an interactive visualization with hvPlot and Bokeh
  • Exercises that highlight best practices for creating interactive visualizations (participants have some options so that we have a range of different visualizations at the end)

Break [10 mins]

Analyze full dataset [45 mins]

  • Introduction to parallel and distributed computing
  • Adapt pandas workflow to use Dask (and Dask Gateway)
  • Exercises that highlight differences compared to pandas and provide scale-friendly alternatives
  • Deep-dive into Dask’s diagnostic dashboard plots
  • Exercises to understand performance and memory issues using the diagnostic dashboard

Visualize full dataset [20 mins]

  • Adapt the previous visualization to the full dataset
  • Example to share best practices for better performance and readability

Break [10 mins]

Collaborative data science [15 mins]

  • Create and share the dashboards with fellow participants on Nebari (a JupyterHub distribution)
  • Create and use a new conda environment for the workflow
  • Best practices for collaborative settings like reproducible conda environments
  • Quick walkthrough of deploying Nebari

Conclusion [15 mins]

  • Mention other tools for large-scale analysis: xarray, zarr, intake, and more
  • Brief discussion around machine learning and array-based workflows
  • Resources to learn more

Prerequisites:

Participants are expected to have some familiarity with Python programming in a data science context. If you know how to create and import Python functions and have some experience doing exploratory data analysis with pandas or NumPy, you will be able to follow along with the tutorial comfortably.

The tutorial material will be in the form of Jupyter Notebooks, so a basic understanding of the notebook interface is nice to have, but there will be a quick primer on using Jupyter Notebooks at the beginning of the tutorial. If participants want to run the tutorial materials locally (which is not necessary because the material will be hosted on the cloud for them), a fundamental understanding of the command line interface, git-based version control, and packaging tools like pip and conda will be helpful.

Speaker Info:

Pavithra Eswaramoorthy is a Developer Advocate at Quansight, where she works to improve the developer experience and community engagement for several open source projects in the PyData community. Currently, she maintains the Bokeh visualization library, and contributes to the Nebari (adjacent to the Jupyter community) and conda-store (part of the conda ecosystem) projects. Pavithra has been involved in the open source community for over 5 years, notable as a maintainer of the Dask library and an administrator for Wikimedia’s OSS programs. In her spare time, she enjoys a good book and hot coffee. :)

Speaker Links:

Pavithra Eswaramoorthy's list of previous talks and GitHub profile.

Section: Cloud Computing
Type: Workshops
Target Audience: Beginner
Last Updated: