Setup a scalable data processing pipeline

Raj Bharath Kannan (~raj_bharath)


Description:

Objective: Scalable data processing pipeline using pyspark.

Tools/Services:

  • Data will be in AWS S3
  • Processing Unit will be EMR with pyspark
  • Orchestration and Scheduling Apache Airflow

Outline of the talk:

  • Objective & Requirements
  • Data pipeline and its components
  • Orchestration & Scheduling
  • Testing the scalability of the system using a few open data use cases
  • Challenges and solutions
  • Lessons Learnt from our production experiences on Data Engineering

Prerequisites:

Knowledge of AWS Services and programming background

Content URLs:

https://docs.google.com/presentation/d/18tvTtYsDGKBXjHfr6Fl0FbRPo_PWGFtd2oGD12iLS4A/edit?usp=sharing

Speaker Info:

  1. Raj Bharath Kannan with 8 yrs of programming experience in various languages and platforms including data engineering, mobile and web application development , infra management etc
  2. Sayan Biswas 4 Years of experience in building Big Data Analytics Platforms. Experienced in working with Big Data Technology stacks (Kafka, Schema Registry, Hadoop, Elasticsearch, Spark).

Speaker Links:

Raj Bharath:

https://www.linkedin.com/in/rajbharath/

https://github.com/rajbharath/

Sayan:

https://www.linkedin.com/in/sayan-biswas-867220136/

Id: 1347
Section: Developer tools and automation
Type: Talks
Target Audience: Intermediate
Last Updated: