Bridging the Silos: Building a No Framework/Framework(NFF) using Apache Airflow

rachit25


1

Vote

Description:

We at AI Palette operate in the FMCG space and working with a plethora of data sources happens to be the core ingestion point to our engineering systems. In the age of Big Data and rapid technological advancements, startups in growth mode are increasingly striving to leverage their data assets and improve their processing capabilities.

A significant part of building this endeavour lies in efficiently managing complex, long-running jobs that have historically been built in silos.

Imagine a framework that transcends these silos, centralises processes, and significantly enhances concurrency and scalability - a truly transformative tool for businesses. Our team over the course of 2+ months has worked on building a solution capable of addressing this problem. We present to you our robust orchestration framework built on Apache Airflow that promises just that, and more. The end to end setup of this framework, including the Airflow instance setup was done in-house, following a migration initiative where we moved away completely from AWS’s Managed Workflow for Apache Airflow (MWAA).

  • This framework happens to be a giant leap for us in saving time, improving efficiency, accuracy, traceability, observability, monitoring all made possible by implementing aspects of concurrency, parallelism, optimisations and distributed computing at the core of it.

  • It is robustly designed to orchestrate over 70 data pipelines (DAG runs) that can run simultaneously, through over a dozen processes, once built in silos, now residing completely within the framework.

  • If you think that's impressive, our framework can also simultaneously scale and run over two dozen workers in parallel thereby maximising the number of tasks that can run concurrently by a further 5x scale. We presently use Celery executors and are in WIP stages also evaluating blue/green deployments to manage load traffic better along with Kubernetes Celery Executors.

In an era where data volumes are skyrocketing and workloads are becoming increasingly complex, the ability to scale both horizontally and vertically is no longer a luxury, but a necessity. Our framework is designed with this fundamental truth at its core. It effortlessly scales to accommodate expanding workloads and resources, thereby providing a truly scalable solution to handle the growing needs of any business.

The idea/concept is to ̉have a set of plugins that we keep adding when there is need and at the same time not get these plugins tagged to one framework. This is like saying “If the only tool you have is a hammer, it is tempting to treat everything as if it were a nail” . This is concept of no framework, framework aka NFF .

We are excited to share our journey, insights, and the lessons we have learned along the way with the Python community at PyCon India.

Prerequisites:

  1. Basic understanding of Apache Airflow, and it’s architecture.
    1. Basic understanding of Kubernetes and Docker.
    2. Basic understanding of multi-processing and multi-threading in python.
    3. Basic understanding of Distributed Systems.

Video URL:

https://www.youtube.com/watch?v=lVR5A-qdoBs

Content URLs:

The slides will be pinned here in some time, in the interim here is a breakdown of the contents that we will be going over as part of the presentation :

Breaking Down the Silos : The Why Behind our Framework

Harnessing the Airflow : The Birth of Our Framework

Conquering Concurrency : Workers Unleashed

Scaling Heights : Venturing Beyond the Horizon 

Nailing Failures : An Art of Error Handling 

The Symphony of Pipelines : Orchestrating Data Flows

A Sneak Peek into the Future : Roadmap and Improvements

Stoking the Flames : A Call to Action

Speaker Info:

Myself Rachit Mishra, I am currently working as a Lead Data Engineer at AI Palette. We are a product-based company operating in the FMCG space and are in rapid growth mode, expanding and pushing through a lot of interesting problem statements to scale our tech.

Over the last 5 years, I have worked in healthcare, fintech and food tech spaces primarily in the Data Engineering avenue.

"How to optimally model, engineer and analyse the data to maximise the quality and richness of insights delivered to customers?" is the question I am obsessed with addressing and answering throughout the journey in my career so far.

Speaker Links:

Connect with me on LinkedIn here.

Section: Developer tools and automation
Type: Talks
Target Audience: Intermediate
Last Updated: