Distributed Data pipelines in Python

Bhavani Ravi (~bhavaniravi)


1

Vote

Description:

The Why

Distributed Data Pipelines, Woah! Woah! Woah! That’s a lot of Jargons. Actually, It’s not.

With data being the new oil, every company wants to create its own lake, a part of the world running behind ML being the new shiny object, No one really talks about the component(The data pipelines) that makes the data consumable so here I am.

In an ideal world that would have been enough, but this is the COVID world. With millions of people being infected, imagine the amount of data being generated every day. For the Pharma industry to have an advantage over the situation is to process the data as and when it’s available and derive business value out of it.

Single streamlined data pipeline may sound cozy but no one want’s to sit through 10 hours for ingesting 2 million records, do you? Trust me you don’t.

The solution? Distributed data pipelines.

The What

  1. Evolution of Data Pipeline - 3 mins
    • In this section, we will cover the conventional methods from cron jobs and ETL pipelines and its drawbacks due to the volume and
      complexity of data we are dealing with
  2. Data pipelines - The Why - 3 mins
    • In this section, we will compare and contrast Conventional methods Vs Data pipeline tools and shed light on Need for Distributed data
      pipelines
  3. Data pipelines - The What - 5 mins
    • Components of Data pipeline system
    • Thinking in data pipelines
  4. Distributed data pipeline with airflow & Kubernetes - The How(Live Coding) - 13 mins
    • What is airflow How to write a data pipeline using airflow Making it distributed without hassle
  5. Q & A - 5 mins

The Outcome

For Data scientist/Researchers - The talk would shed light on what it takes to convert their ML models into a consumable, scalable production system, how to work hand in hand with system engineers

For Aspiring Data scientist - The talk shows how ML is not all about fancy models but about the data and systems around it

For System Engineers - You would relate to this talk the most because it will show how to move ML models to production,

Prerequisites:

Mandatory

  1. Python

Nice to have

  1. Machine learning
  2. Kubernetes
  3. Docker

Content URLs:

https://bhavaniravi.com/blog/deploying-airflow-on-kubernetes

https://bhavaniravi.com/blog/apache-airflow-introduction

https://www.youtube.com/watch?v=4VBHUB5jLnk

Speaker Info:

Bhavani Ravi is working as a Research Engineer at Saama Technologies converting ML models to consumable backend systems. She is an Open-source enthusiast and has contributed to Pandas and Rasa(the chatbot library). She is also an ambassador of WomenTechmakers Chennai and Co-organizes meetup for Google Developer Group Chennai

Speaker Links:

I used to write at medium, but the paywall was too much. Hence I moved my blog here

I also create tutorial videos on youtube

I have contributed code to Pandas

Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: