Building Data Confidence: Unleashing the power of Python, Airflow, and Great Expectations(GX) for standardizing and streamlining robust data validation.

SOURAV ROY (~sourav4)


7

Votes

Description:

Abstract

In today's data-driven world, ensuring data quality and building data confidence is paramount for making informed business decisions. However, data validation can be a complex and time-consuming task where multiple data sources are involved, often leading to errors and inconsistencies in data analysis. This talk aims to address the need for data confidence and the importance of identifying data problems at an early stage, prior to deployment where it goes unnoticed until it starts causing issues. We will explore how the powerful combination of Python, Airflow, and Great Expectations popular called GX library can be leveraged to build a robust and fully cloud native automated data validation pipeline, enabling standardisation and ease of access to data validation.

Introduction(3 mins)

We shall start off discussing the criticality of data confidence in decision-making processes. We will explore the repercussions of relying on inaccurate or inconsistent data. The need for a proactive approach to identify and rectify data problems before deployment.

GX Dive(5 mins)

Followed by, delve into the concept of Great Expectations, a powerful open-source library that plays a pivotal role in building the data validation pipeline. the key benefits of Great Expectations, including its ability to define, manage, and automate data expectations, as well as its capability to perform comprehensive data profiling and validation. We will highlight Through code examples, we will demonstrate how Great Expectations enables data practitioners to build a robust data validation engine, ensuring data quality and consistency across the entire data lifecycle.

Integration of Python GX data validation engine into data pipeline using Airflow(10 mins)

Building upon the foundation of Great Expectations, we will explore how Airflow, a popular workflow management platform, can be used to manage the data validation pipeline. We will showcase how Airflow allows for the creation of automated, scalable, and orchestrated workflows, enabling seamless execution of data validation tasks every time data validation runs on a new set of data. Attendees will gain insights into how Airflow can be leveraged to schedule, monitor, and manage the data validation pipeline efficiently, thereby reducing manual effort and increasing overall productivity. Rendering the results of data validation over statically hosted site and alerting mechanisms to end-users will also be demonstrated.

Benefits of standardisation and Ease of Access to Data Validation(5 mins)

One of the primary objectives of the data validation pipeline is to establish standardisation. We will discuss the importance of establishing common data expectations and how this promotes consistency across different data sources and projects. Additionally, we will highlight how the pipeline facilitates collaboration among data practitioners by providing a unified platform for sharing, reusing, and iterating upon data validation rules and tests.

Conclusion(2 mins)

In conclusion, this talk will highlight the need for data confidence, early identification of data problems, and the role of Great Expectations and Airflow in building a robust data validation pipeline. Attendees will leave with a clear understanding of how to leverage Python, Airflow, and Great Expectations to establish a standardised, automated, and integrated approach to data validation, ultimately enabling more reliable and trustworthy data-driven decision-making and avoid any surprise in the data once those are into production environment.

QnA(5 mins)

Prerequisites:

  • Knowledge of Object Oriented Programming in Python(>=3.7)
  • Knowledge of Workflow scheduling/Orchestration using Airflow(Optional)

Speaker Info:

A python professional with 10 years of industry experience developing scalable python projects using multiple backend frameworks. Passionate about data and its transformation across various stages of data pipeline and data lifecycle. A core python enthusiast currently working as Senior python developer with Seneca Global IT Services, Hyderabad.

Speaker Links:

LinkedIn- Sourav Roy

Section: Others
Type: Talks
Target Audience: Intermediate
Last Updated: