Unified Data Processing with Apache Beam

Raj Rakesh (~raj84)


Description:

Currently, some popular data processing frameworks such as Apache Spark consider batch and stream processing jobs differently ( Spark and Spark Streaming). The APIs across different processing systems such as Apache Spark or Apache Flink are also different. This forces the end user to learn a potentially new system every time.

Apache Beam addresses this problem by providing a unified programming model that can be used for both batch and streaming pipelines. The Beam SDK allows the user to execute these pipelines against different execution engines ( may be Spark Cluster or Google Cloud Dataflow etc.)

Through this talk, we start off by providing an overview of Apache Beam using the Python SDK and the problems it tries to address from an end user’s perspective. We cover the core programming constructs in the Beam model such as PCollections, ParDo, GroupByKey, windowing, and triggers. We describe how these constructs make it possible for pipelines to be executed in a unified fashion in both batch and streaming. Then we use examples to demonstrate these capabilities. The examples showcase using Beam for stream processing and real-time data analysis, and how Beam can be used for feature engineering in some Machine Learning applications using Tensorflow.

Outline

  1. Batch and Stream Data Processing
  2. Problems with different Data Processing Pipelines.
  3. Unified & Portable Approach
  4. Apache Beam Model – Why so good?
  • Features
  • Execution Engines
  • Windowing
  1. Examples
  2. Q/A

Target Audience :

  • Data Engineering Developers
  • Big Data Developers
  • Data Science Professionals

Prerequisites:

Preferable

  • Exposure to writing Data Pipeline with Python
  • Basic Data Engineering with Python.

Content URLs:

  • Presentation to be used for session.
    • Contents of the deck will mostly be used from official documentation of Apache Beam
    • For code base related to Apache beam Example, follow the github.
    • Intro video and teaser.

Speaker Info:

Raj is Solution Architect - IoT Cloud Platforms at Hitachi Consulting with over 7+ years of industry experience in Data Engineering and Data Science. He holds 4 Google Cloud Professional certifications and is passionate about data. He has worked extensively in the field of Data Engineering across different Big Data Processing frameworks and now on Public Clouds. His favorite language to code is Python alongside Go for all his Data Engineering and Science work.

Speaker Links:

Blogs

Profile

Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Beginner
Last Updated: