Collaboration in Data Science: Tools, Challenges, and Best Practices

pavithraes | 11 Aug, 2023

1

Vote

Description:

The PyData ecosystem, which includes open source Python libraries for scientific computing and machine learning, is at the heart of the rapidly advancing data science landscape. Thanks to the dedicated focus on interoperability, the PyData community has built a powerful toolkit for individuals that works across several scientific and business domains. However, data science work in research and industries is almost always done in teams, and we’re yet to ensure a good story exists for effective collaboration while using PyData tools. For example, we have tools like JupyterHub for shared infrastructure, conda for package & environment management, and Dask for distributed computing, but the process for setting-up a stable platform with all these tools requires significant DevOps expertise.

In this talk, we’ll cover the need and importance of data science collaboration tools and principles on an “infrastructure” level. We’ll look at the current collaboration gaps in our ecosystem, with a focus on Jupyter and conda tools. We’ll then discuss some solutions and curated best practices based on my personal experience navigating these gaps, including how to:

Share ongoing work and dashboard visualizations with reproducible data science environments
Design for scalability (distributed compute) and productionization from the foundation
Manage team resources to minimize cloud costs and ensure reliability
Incorporate MLOps and security best practices quickly

My colleagues at Quansight implemented these practices by creating two community open source projects - Nebari and conda-store, and learned numerous lessons along the way. We won’t get into the details of these tools, but I’ll share some generalizable, tried-and-tested, and opinionated workflows that have worked well for our team and clients. By the end, I hope you’re equipped with the tools and knowledge to promote better collaboration in your data science team. :)

Prerequisites:

A basic understanding of Python-based data science tools (NumPy, pandas, matplotlib, etc.) and workflows (exploratory analysis, visualization, etc.) is required. If you have used Jupyter Notebooks, created environments using the conda package manager, and performed a groupby operation in pandas, you should be able to follow along with the talk comfortably.

Although not necessary, experiential knowledge of data workflows, previous experience working in a team, and familiarity with distributed computing principles - will help you get the most value out of this talk.

Speaker Info:

Pavithra Eswaramoorthy is a Developer Advocate at Quansight, where she works to improve the developer experience and community engagement for several open source projects in the PyData community. Currently, she maintains the Bokeh visualization library, and contributes to the Nebari (adjacent to the Jupyter community) and conda-store (part of the conda ecosystem) projects. Pavithra has been involved in the open source community for over 5 years, notable as a maintainer of the Dask library, an administrator for Wikimedia’s OSS programs. In her spare time, she enjoys a good book and hot coffee. :)

Speaker Links:

Pavithra Eswaramoorthy's previous talks and GitHub profile.

Section:	Developer tools and automation
Type:	Talks
Target Audience:	Intermediate
Last Updated:	12 Aug, 2023

Comments