Setup a scalable data processing pipeline

Raj Bharath Kannan (~raj_bharath) | 29 Jun, 2019

Description:

Objective: Scalable data processing pipeline using Pyspark. This talk, will give the bigger picture of a data pipeline setup involving various components, its deployment and integration. We hope, sharing the challenges during the setup and issues faced while processing 2TB of data and how we overcame that will help others.

Tools/Services:

Data will be in AWS S3
Processing Unit will be EMR with pyspark
Orchestration and Scheduling Apache Airflow

Outline of the talk:

Objective & Requirements
Data pipeline and its components
Orchestration & Scheduling
Testing the scalability of the system using a few open data use cases
Challenges and solutions
Lessons Learnt from our production experiences on Data Engineering

Prerequisites:

Knowledge of AWS Services and programming background

Content URLs:

https://docs.google.com/presentation/d/18tvTtYsDGKBXjHfr6Fl0FbRPo_PWGFtd2oGD12iLS4A/edit?usp=sharing

Speaker Info:

Raj Bharath Kannan with 8 yrs of programming experience in various languages and platforms including data engineering, mobile and web application development , infra management etc
Sayan Biswas 4 Years of experience in building Big Data Analytics Platforms. Experienced in working with Big Data Technology stacks (Kafka, Schema Registry, Hadoop, Elasticsearch, Spark).

Speaker Links:

Raj Bharath:

https://www.linkedin.com/in/rajbharath/

https://github.com/rajbharath/

Sayan:

https://www.linkedin.com/in/sayan-biswas-867220136/

Section:	Developer tools and automation
Type:	Talks
Target Audience:	Intermediate
Last Updated:	26 Jul, 2019

Comments