Setup a scalable data processing pipeline

Raj Bharath Kannan (~raj_bharath)


Description:

Objective: Scalable data processing pipeline using Pyspark. This talk, will give the bigger picture of a data pipeline setup involving various components, its deployment and integration. We hope, sharing the challenges during the setup and issues faced while processing 2TB of data and how we overcame that will help others.

Tools/Services:

  • Data will be in AWS S3
  • Processing Unit will be EMR with pyspark
  • Orchestration and Scheduling Apache Airflow

Outline of the talk:

  • Objective & Requirements
  • Data pipeline and its components
  • Orchestration & Scheduling
  • Testing the scalability of the system using a few open data use cases
  • Challenges and solutions
  • Lessons Learnt from our production experiences on Data Engineering

Prerequisites:

Knowledge of AWS Services and programming background

Content URLs:

https://docs.google.com/presentation/d/18tvTtYsDGKBXjHfr6Fl0FbRPo_PWGFtd2oGD12iLS4A/edit?usp=sharing

Speaker Info:

  1. Raj Bharath Kannan with 8 yrs of programming experience in various languages and platforms including data engineering, mobile and web application development , infra management etc
  2. Sayan Biswas 4 Years of experience in building Big Data Analytics Platforms. Experienced in working with Big Data Technology stacks (Kafka, Schema Registry, Hadoop, Elasticsearch, Spark).

Speaker Links:

Raj Bharath:

https://www.linkedin.com/in/rajbharath/

https://github.com/rajbharath/

Sayan:

https://www.linkedin.com/in/sayan-biswas-867220136/

Section: Developer tools and automation
Type: Talks
Target Audience: Intermediate
Last Updated: