Data Collection using Airflow

Priyanshu Sinha (~priyanshu7) | 05 Aug, 2023

0

Votes

Description:

At AI Palette, where we cater to clients in the FMCG space across geographies to help them identify the upcoming trends in the market, we need to understand what is being offered to the customer and how people perceive the products that are available. To address these challenges we need to have data from the supply side (e-commerce and food platforms) and the demand side (social platforms and search results). That’s where a major task of collecting and transforming this data to get insights comes into picture.

To get good quality data we need to answer the following questions. 1. What platforms do we target to get the intended data? 2. Check if we can crawl data from the identified platforms and identify restrictions if any? 3. Crawl the data from the website and store all the relevant information. 4. Enrich the collected data to derive insights. 5. Add all the above steps in an orchestration framework.

The workshop will focus on the high level design of the data collection process along with the orchestration framework built on Apache Airflow that enables robustness, scalability and high performance.

Prerequisites:

Basic understanding of Apache Airflow, and its architecture.
Familiarity with python libraries used for web scraping, such as Requests, BeautifulSoup, and Selenium.

Video URL:

https://drive.google.com/file/d/1DJoRepcKJMALFsUGXYkLqOvkrNDTw-wk/view?usp=drive_link

Content URLs:

The slides and video to be updated shortly, here is a breakdown of the contents that we will be going over as part of the presentation at the conference:-

1.Data Sources and Destinations: We'll explore the platforms we target to gather data and where we store the insights. 2.Architecture Overview and ETL Process: We'll delve into the high-level design of our data collection process. 3.Scalability and Performance: The ability to scale and perform efficiently is critical in handling vast amounts of data. We'll share our strategies for achieving this. 4.Scheduling and Orchestration: How do we ensure the process runs smoothly? We'll reveal how we use Apache Airflow to manage our workflows effectively.

Speaker Info:

Priyanshu and Purna currently work at AI Palette as Data Engineer and Full Stack Engineer respectively and have been working on the mission to bridge the gap between raw data and actionable insights, enabling businesses to make data-driven decisions with confidence.

Speaker Links:

Priyanshu Sinha : LinkedIn

Purna Chandra Reddy : LinkedIn

Section:	Developer tools and automation
Type:	Talks
Target Audience:	Beginner
Last Updated:	05 Aug, 2023

Comments