Setup a scalable data processing pipeline
Raj Bharath Kannan (~raj_bharath) |
Objective: Scalable data processing pipeline using Pyspark. This talk, will give the bigger picture of a data pipeline setup involving various components, its deployment and integration. We hope, sharing the challenges during the setup and issues faced while processing 2TB of data and how we overcame that will help others.
- Data will be in AWS S3
- Processing Unit will be EMR with pyspark
- Orchestration and Scheduling Apache Airflow
Outline of the talk:
- Objective & Requirements
- Data pipeline and its components
- Orchestration & Scheduling
- Testing the scalability of the system using a few open data use cases
- Challenges and solutions
- Lessons Learnt from our production experiences on Data Engineering
Knowledge of AWS Services and programming background
- Raj Bharath Kannan with 8 yrs of programming experience in various languages and platforms including data engineering, mobile and web application development , infra management etc
- Sayan Biswas 4 Years of experience in building Big Data Analytics Platforms. Experienced in working with Big Data Technology stacks (Kafka, Schema Registry, Hadoop, Elasticsearch, Spark).