Unified and Portable Parallel Data Processing using Apache Beam

mukul arora (~mukul11) | 22 Jun, 2018

83

Votes

Description:

Data together with 3Vs characteristic, volume, variety and velocity is labelled as Big Data. Big Data and parallel processing have been hot topics since Google’s paper on MapReduce and till today the era of different runners like Apache Spark, Google Cloud Dataflow etc.

Apache Beam is a unified big data processing paradigm which enables the user to run batch and streaming data processing jobs on multiple execution engines like Apache Spark, Apache Flink, Google Cloud Dataflow etc.

*Objective of the talk*:

Overview of Apache Beam Python SDK
Core SDK constructs like Pipeline, PTransform, PCollection etc.
Creating custom DoFns and composite Transforms
Creating a Pipeline with customizable options
Running a pipeline on different runners like DirectRunner, DataflowRunner etc
Unit testing a Pipeline with asserts
Demo: StreamingWordCount example using Google Cloud Dataflow
Q&A

Prerequisites:

A little knowledge about Python 2.7
Enthusiasm for Parallel Data Processing
Motivation to play with lots of Data

Content URLs:

Apache Beam: https://beam.apache.org/
Apache Beam Python SDK: https://beam.apache.org/documentation/sdks/pydoc/2.4.0

Speaker Info:

I am Mukul Arora, working as a Software Engineer in Schlumberger India Technology Centre. I graduated from Delhi Technology University in May 2017. I am a Data Science and Big Data practitioner and have been highly involved in solving Computer Vision and Medical Imaging problems using Deep Learning Techniques. Currently, I am exploring efficient ways to solve Big Data problems on Cloud. I am an avid cricket fan and love to write poems.

Speaker Links:

LinkedIn: https://www.linkedin.com/in/mukularoradce/

Github: https://github.com/codemukul95

YourQuote: https://www.yourquote.in/mukul-arora-ffds/quotes/

Section:	Others
Type:	Talks
Target Audience:	Intermediate
Last Updated:	22 Jun, 2018

Comments