Setup a scalable data processing pipeline

Raj Bharath Kannan (~raj_bharath)


Objective: Scalable data processing pipeline using pyspark.


  • Data will be in AWS S3
  • Processing Unit will be EMR with pyspark
  • Orchestration and Scheduling Apache Airflow

Outline of the talk:

  • Objective & Requirements
  • Data pipeline and its components
  • Orchestration & Scheduling
  • Testing the scalability of the system using a few open data use cases
  • Challenges and solutions
  • Lessons Learnt from our production experiences on Data Engineering


Knowledge of AWS Services and programming background

Content URLs:

Speaker Info:

  1. Raj Bharath Kannan with 8 yrs of programming experience in various languages and platforms including data engineering, mobile and web application development , infra management etc
  2. Sayan Biswas 4 Years of experience in building Big Data Analytics Platforms. Experienced in working with Big Data Technology stacks (Kafka, Schema Registry, Hadoop, Elasticsearch, Spark).

Speaker Links:

Raj Bharath:


Id: 1347
Section: Developer tools and automation
Type: Talks
Target Audience: Intermediate
Last Updated: