Automating Data Pipeline using Apache Airflow

Mridu Bhatnagar (~mridubhatnagar)


Description:

Today, we are moving towards machine learning. Making predictions, finding out insights based on data. For the same purpose, the initial step is to have efficient processes in place which help us in collecting data from various different data sources. Using traditional ways to collect data is tedious and cumbersome. Manually running scripts to extract, transform and load data is a trade-off with time.

To make the process efficient. The data pipeline can be automated. Scripts to extract data can be auto-scheduled using crontab. However, using crontab has its own drawbacks. One major challenge comes in monitoring. This is where an open-source tool built by Airbnb engineering team - Apache airflow helps. Airflow is a platform to programmatically author, schedule and monitor workflows.

Prerequisites:

Basic Knowledge of Python

Content URLs:

https://docs.google.com/presentation/d/1n510k4_UgUM1UQqqaqJBhPcp82DmORwkWLkwyxnzOmk/edit?usp=sharing

Outline of the Talk

  1. Background [ Extract, Transform, Load] - 2 mins
  2. Walkthrough the traditional approach of automation using Cron Job - 3 mins
  3. Explain each and every shortcoming of using a cron job[logging, Monitoring] along with use cases where cron job is a better choice for automation - 4 mins
  4. Breakdown the title into distinct words and explain from scratch. Automation + Data + Pipeline + Apache Airflow - 4-5 mins
  5. Introduction to Apache Airflow. Explain Terminologies Workflow, Operators, Acyclic Graph, Directed Acyclic Graphs - 10 mins
  6. Screenshots along with an explanation of UI interface and shortcomings with Apache Airflow - 5 mins

Speaker Info:

I am Mridu Bhatnagar. A computer science and engineering graduate from NIIT University, batch of 2013-2017. I am working as a software engineer with Goibibo as a part of the Marketing Technology team. On weekends I love to volunteer, attend meetups and share the learnings. Tech Stack I primarily work on is Python and its related web frameworks.

Github Link: https://github.com/mridubhatnagar

Twitter Link: https://twitter.com/Mridu__

Speaker Links:

Past Experience [December 2018 - Present]

PyData Delhi meetup

a. Introduction to APIs - https://github.com/pydatadelhi/talks/issues/81 Talk Video - https://drive.google.com/open?id=1JpAkqHQAKjHtb9sancMIsYUvGBLC4sKX

b. Virtual Environment in Python - https://github.com/pydatadelhi/talks/issues/85

LinuxChix India

a. Tech Journey so far - https://github.com/linuxchixin/talks/issues/65 Linux User Group Delhi [ILUGD] a. Playing around with APIs - https://github.com/ILUGD/talks/issues/96 b. https://github.com/ILUGD/talks/issues/106

Pyladies Delhi a. Virtual Environment in Python - https://github.com/PyLadiesDelhi/talks/issues/20

Hackr.io a. Python for All - https://www.meetup.com/Hackr-Bootcamp/events/260880214/ Video - https://www.facebook.com/hackr.io/videos/2306449919614208/

LetsPy Delhi a. Small Video - https://www.facebook.com/LetsPyDelhi/videos/342741663026078/ b. https://www.facebook.com/LetsPyDelhi/photos/a.313241799540506/335501603981192/?type=3&theater

DjangoGirls, Bangalore a. Coach[February, 2019] - https://djangogirls.org/bangalore/

DjangoGirls, Pune a. Coach[22-06-2019] - https://djangogirls.org/pune/

Women who Go, Delhi + Pyladies Delhi + LinuxChix India combined meetup a. Understanding HTTP from ground up - https://www.meetup.com/New-Delhi-Women-Who-Go/events/261596323/

Drupal Camp 2019, Delhi a. Automating data pipelines using Apache AIrflow

Blogs

  1. Pybites Blog - https://pybit.es/guest-pybites-blog-tag-analysis-plotly.html
  2. Medium personal blog Twitter Data Retrieval - https://medium.com/@mridubhatnagar/twitter-data-retrieval-9d5c79870a0f Word Notifier - https://medium.com/@mridubhatnagar/word-notifier-c5e0d765e56c

Id: 1274
Section: Developer tools and automation
Type: Talks
Target Audience: Intermediate
Last Updated: