Revamp your DevOps,MLOps game: Apache Airflow to orchestrate smart workflows
LAISHA WADHWA (~laisha77) |
Have you been struggling with running redundant data pipelines and manually transferring files across databases? You have built ML projects but find it difficult to put it to production? This talk will introduce an amazing workflow orchestrator tool “Airflow” and you’ll know why leading startups are using it to build customized ML workflows at scale!
Airflow can sound more complicated than it is. Well, this talk is all about using it for doing anything to everything, Airflow doesn't care. The Data Science field comes with a lot of data munging, processing, and then finally using the data for modeling. A lot of time is invested in streamlining the whole ETL pipeline and consuming data from multiple sources. For building applications at scale and fabricating the process of building complex data workflows, workflow orchestration is vital.
As a Data Engineer, It has personally helped me to streamline the Data processing pipelines while reducing manual tasks and increasing efficiency. In the age of automation, Workflow orchestration is just what every data scientist needs to automate recurring tasks like fetching data periodically, Monitoring Cron jobs, and much more.
- Background [5 minutes]
- Once upon a time, there was CRON
- What are DAGS?
- Direct Acyclic Graph (DAG) - nodes are tasks and edges are the dependency structure.
- Airflow concepts[10 minutes]
- What is Airflow?
- Programmatically author workflows, Stateful scheduling, Rich CLI and UI, Logging, monitoring, and alerting, Modularity lends itself well to testability, batch processing to solve common problems.
- An analogy of Supply chain management with airflow.
- Why should you be interested in working with it?
- What value does Airflow add?
- Retries task elegantly, which handles transient network errors, Alerts on failure (email or slack), re-runs specific tasks in a large DAG, supports distributed execution, Awesome OSS community and momentum, host it anywhere-AWS, Azure, or GCP.
- The terms: DAG, task Instance, DAG run, and operator
- What is Airflow?
- Airflow operators [8 minutes]
- Python Operator
- Email Operator
- Custom operators to trigger spark jobs
- web scraping operators
- All about loving DAGS [5 minutes]
- Some amazing use-cases - ML ETL pipeline automation [7 minutes]
- Demo and examples
- Personal experiences of using it.
- Customizable platform - add custom operators and closing note[5 minutes]
What is Airflow and why to use it?
Apache Airflow, a workflow management system developed and open-sourced by Airbnb in 2015, comes in handy not only for writing basic ETL pipelines but also for a plethora of TEDIOUS TASKS like fetching data periodically, Monitoring Cron jobs, and much more. The possibilities are endless. It's easy to use a python-based tool that makes all the cumbersome tasks really simple. Just create DAG for each task & start linking up. In the Era of Machine and Deep Learning we are slowly moving towards writing ETL(Extract Transform Load) pipelines for all data preparation and preprocessing tasks, but the world is not all about writing ETL pipelines to automate things!! There are many other use cases where we have to perform tasks in a certain order once or periodically. For instance: 1. Monitoring Cron jobs 2. Scheduling web scrapers. 3. Data transfers across databases 4. Machine Learning Pipelines.
You would agree, a flowchart is much more interpretable than lines of code to understand the flow of a system. The entire workflow of any project can be converted into a DAG (directed acyclic graph) with a workflow orchestrator. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed with an interactive Dashboard with Lots of information quickly accessible -- task logs, task history, etc.
Let’s say you are making a pizza:
And you want to do it every day for 1000’s of customers. How do you do it?
You automate it!
You may choose to distribute it in multiple steps. Here's a DAG for making PIZZA.
- Similarly, we have DAGS for real-world use cases: for example, here we are trying to collect data from multiple sources, aggregate it, transform it, and get insights/build a model.
Why only Apache Airflow?
Most importantly, the airflow scheduler executes tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. - - The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
Advantages of Airflow
In a nutshell, Airflow helps to automate scripts in order to perform tasks. It's Python-based but you can execute a program irrespective of the language. For instance, if the first stage of your workflow has to execute a C++ based program to perform image analysis and then a Python-based program to transfer that information to S3, Airflow is just the tool for you. Its graphical interface is easy to use and has amazing community support.
Who should/can attend?
The talk is for anyone who loves to automate things for themselves or works with data all day long without having to get into the nitty-gritty of manually doing the same task all over again. Apache airflow is the perfect orchestrator tool that helps you revamp your DevOps game and is a great tool for production as well. So if you are planning to move your experimentations to production, it's just the talk for you! People familiar with Python and the basics of ML will find the talk interesting while beginner level people will be encouraged to try out this tool.
I'll begin by introducing the concept of Apache Airflow - a platform to programmatically author, schedule, and monitor workflows. During the course, I'll be demonstrating how Airflow can be used as a workflow orchestrator for designing easy to use ETL pipelines
- You'll learn how to automate your Queries, python Code or Jupyter Notebook through different operators. -Airflow provides a monitoring and managing interface, where it is possible to have a quick overview of the status of the different tasks, as well as have the possibility to trigger and clear tasks or DAGs runs.
- You'll get to know about various open-source contributions possible for Apache Airflow. They have a very active slack channel => great community support.
- Python Fundamentals
- Concept of DAGS
- Basics of ETL pipelines (Extraction, Transformation, and Loading of data)
The slide deck is briefly also covered in the preview video. (Basic Outline)
I am a Data Engineer at Couture.ai, India. I have been working with python for over 3 years now and I am a big time machine Learning aficionado. In the past few years I have worked working with Computer Vision and Music Analysis related search. While I am not working I build use AI and ML based applications for social good and work on building applications at scale while at work. I love participating in hackathons. I am multiple hackathon winner (Microsoft AI hacakthon, Sabre Hack, Amex AI hackathon, Icertis Blockchain and AIML hackathon, Mercedes Benz Digital Challenge) and people often call me "The Hackathon Girl". As a tech enthusiast, I enjoy sharing my knowledge and work with the community. I am a tech speaker(Pyconf Hyd 2019), tech blogger, podcast host(https://bit.ly/2EMo5sh), hackathon mentor at MLH hacks , Technical content creator at Omdena and Global Ambassador at Women.Tech Network I believe in hacking my way through life one bit at a time.