Unified and Portable Parallel Data Processing using Apache Beam
mukul arora (~mukul11) |
Data together with 3Vs characteristic, volume, variety and velocity is labelled as Big Data. Big Data and parallel processing have been hot topics since Google’s paper on MapReduce and till today the era of different runners like Apache Spark, Google Cloud Dataflow etc.
Apache Beam is a unified big data processing paradigm which enables the user to run batch and streaming data processing jobs on multiple execution engines like Apache Spark, Apache Flink, Google Cloud Dataflow etc.
*Objective of the talk*:
- Overview of Apache Beam Python SDK
- Core SDK constructs like Pipeline, PTransform, PCollection etc.
- Creating custom DoFns and composite Transforms
- Creating a Pipeline with customizable options
- Running a pipeline on different runners like DirectRunner, DataflowRunner etc
- Unit testing a Pipeline with asserts
- Demo: StreamingWordCount example using Google Cloud Dataflow
- A little knowledge about Python 2.7
- Enthusiasm for Parallel Data Processing
- Motivation to play with lots of Data
- Apache Beam: https://beam.apache.org/
- Apache Beam Python SDK: https://beam.apache.org/documentation/sdks/pydoc/2.4.0
I am Mukul Arora, working as a Software Engineer in Schlumberger India Technology Centre. I graduated from Delhi Technology University in May 2017. I am a Data Science and Big Data practitioner and have been highly involved in solving Computer Vision and Medical Imaging problems using Deep Learning Techniques. Currently, I am exploring efficient ways to solve Big Data problems on Cloud. I am an avid cricket fan and love to write poems.