Architecting data products at scale with Python and AWS Serverless
Atish Kathpal (~atish7) |
Description:
I intend to cover following topics during my talk.
- What/Why/How of Data Engineering
- Evolving landscape: Evolution of data platforms over last couple of decades (Data Lakes -> Data warehouses -> Lakehouse architectures)
- Birds Eye view of Data engineering tech stack (Data sources, storage, processing, Data consumers, Data products and BI)
- Workloads: Data engineering workloads and access patterns (OLTP vs OLAP, I/O patterns)
- Storage: File and storage formats in data engineering
- Data Processing: Python + Spark => Pyspark. Transform raw data into organized datasets that power data science, machine learning and AI use cases.
- Data cataloging: Pydantic, Pandas, AWS Glue Data Catalog
- Practical insights and End 2 End architecture walkthrough (Medallion Architecture)
- Role of python
- Common pitfalls
- AWS serverless and Python swiss knife for Data engineers with what and whys - Pydantic, Pyspark on EMR serverless, S3, DynamoDB, AWS Lambdas, EventBridge, Athena, Quicksight
- Metrics from production system we built at KnowBe4 that is built on same architecture principles and how it scales seamlessly for 80 million users and millions of cybersecurity events per day.
Prerequisites:
Undergraduate level knowledge in Computer Science and 1-2 years of working knowledge in Python. Keen interest in building distributed backend systems solving for scale and resiliency.
Content URLs:
Speaker Info:
Leading Data Engineering and AI Production Systems teams at KnowBe4.
Passionate about building distributed system stacks and data products that solve real world problems. I hold a Bachelor’s in Computer Science from BITS Pilani. I have over a decade of experience providing technical leadership, architecting and developing large-scale distributed systems. My past work includes published papers at conferences like USENIX, alongside 7 patents on software innovations.
Speaker Links:
Past talks, blogs and presentations: