Setup a scalable data processing pipeline
Raj Bharath Kannan (~raj_bharath) |
Objective: Scalable data processing pipeline using pyspark.
- Data will be in AWS S3
- Processing Unit will be EMR with pyspark
- Orchestration and Scheduling Apache Airflow
Outline of the talk:
- Objective & Requirements
- Data pipeline and its components
- Orchestration & Scheduling
- Testing the scalability of the system using a few open data use cases
- Challenges and solutions
- Lessons Learnt from our production experiences on Data Engineering
Knowledge of AWS Services and programming background
- Raj Bharath Kannan with 8 yrs of programming experience in various languages and platforms including data engineering, mobile and web application development , infra management etc
- Sayan Biswas 4 Years of experience in building Big Data Analytics Platforms. Experienced in working with Big Data Technology stacks (Kafka, Schema Registry, Hadoop, Elasticsearch, Spark).