Automating Data Pipeline using Apache Airflow

Mridu Bhatnagar (~mridubhatnagar)


Description:

Today, we are moving towards machine learning. Making predictions, finding out insights based on data. For the same purpose, the initial step is to have efficient processes in place which help us in collecting data from various different data sources. Using traditional ways to collect data is tedious and cumbersome. Manually running scripts to extract, transform and load data is a trade-off with time.

To make the process efficient. The data pipeline can be automated. Scripts to extract data can be auto-scheduled using crontab. However, using crontab has its own drawbacks. One major challenge comes in monitoring. This is where an open-source tool built by Airbnb engineering team - Apache airflow helps. Airflow is a platform to programmatically author, schedule and monitor workflows.

Prerequisites:

Basic Knowledge of Python

Content URLs:

https://docs.google.com/presentation/d/1puwggckL14kb0CXiV0g-iXjm--2bNeor57vCzje278M/edit?usp=sharing

Outline of the Talk

  1. Background [ Extract, Transform, Load] - 2 mins
  2. Walkthrough the traditional approach of automation using Cron Job - 3 mins
  3. Explain each and every shortcoming of using a cron job[logging, Monitoring] along with use cases where cron job is a better choice for automation - 4 mins
  4. Breakdown the title into distinct words and explain from scratch. Automation + Data + Pipeline + Apache Airflow - 4-5 mins
  5. Introduction to Apache Airflow. Explain Terminologies Workflow, Operators, Acyclic Graph, Directed Acyclic Graphs - 10 mins
  6. Screenshots along with an explanation of UI interface and shortcomings with Apache Airflow - 5 mins
  7. Airflow Architecture

Speaker Info:

Mridu Bhatnagar is a software development engineer at Goibibo, organizes DjangoGirls Indore, Pyladies Delhi. Tech stack she is currently working on is Python and Django. When not coding she loves to experience outdoors, volunteer as a speaker to share her learnings and learn from other enthusiasts.

Github Link: https://github.com/mridubhatnagar

Twitter Link: https://twitter.com/Mridu__

Speaker Links:

Past Experience [December 2018 - Present]

PyData Delhi meetup

a. Introduction to APIs - https://github.com/pydatadelhi/talks/issues/81 Talk Video - https://drive.google.com/open?id=1JpAkqHQAKjHtb9sancMIsYUvGBLC4sKX

b. Virtual Environment in Python - https://github.com/pydatadelhi/talks/issues/85

LinuxChix India

a. Tech Journey so far - https://github.com/linuxchixin/talks/issues/65 Linux User Group Delhi [ILUGD] a. Playing around with APIs - https://github.com/ILUGD/talks/issues/96 b. https://github.com/ILUGD/talks/issues/106

Pyladies Delhi a. Virtual Environment in Python - https://github.com/PyLadiesDelhi/talks/issues/20

Hackr.io a. Python for All - https://www.meetup.com/Hackr-Bootcamp/events/260880214/ Video - https://www.facebook.com/hackr.io/videos/2306449919614208/

LetsPy Delhi a. Small Video - https://www.facebook.com/LetsPyDelhi/videos/342741663026078/ b. https://www.facebook.com/LetsPyDelhi/photos/a.313241799540506/335501603981192/?type=3&theater

DjangoGirls, Bangalore a. Coach[February, 2019] - https://djangogirls.org/bangalore/

DjangoGirls, Pune a. Coach[22-06-2019] - https://djangogirls.org/pune/

Women who Go, Delhi + Pyladies Delhi + LinuxChix India combined meetup a. Understanding HTTP from ground up - https://www.meetup.com/New-Delhi-Women-Who-Go/events/261596323/

Drupal Camp 2019, Delhi a. Automating data pipelines using Apache AIrflow

Blogs

  1. Pybites Blog - https://pybit.es/guest-pybites-blog-tag-analysis-plotly.html
  2. Medium personal blog Twitter Data Retrieval - https://medium.com/@mridubhatnagar/twitter-data-retrieval-9d5c79870a0f Word Notifier - https://medium.com/@mridubhatnagar/word-notifier-c5e0d765e56c

Section: Developer tools and automation
Type: Talks
Target Audience: Intermediate
Last Updated: