+1 -1 +48
Vote on this proposal

Python + Spark: Lightning Fast Cluster Computing

by Jyotiska NK (speaking)

Section
Infrastructure
Technical level
Intermediate

Objective

To give an overview and discussion on Apache Spark, a fast cluster computing framework for large-scale data processing. Also discuss use cases and demonstrate the advantages of Spark over Hadoop using programs written with PySpark API for analyzing large datasets.

Description

Apache Spark is an in-memory cluster computing framework, originally developed in UC Berkeley. Spark runs either standalone, Amazon EC2 or on Apache Mesos(a cluster manager). Thats why, Spark can co-locate with Hadoop and can be deployed in a Hadoop cluster. Also, along with local data, Spark can be used to process data stored in HDFS, HBase, Hive, Cassandra. Spark is designed to perform both general data processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Results have shown that Spark can be 10 times to upto 100 times faster than Hadoop on datasets scaling up to Terabytes. Currently, it supports Scala, Java and Python for writing programs.

In this talk, I will be explaining the following concepts on Spark - General Overview, RDD (Resilient Distributed Datasets) which is a read-only, partitioned collection of records through which Spark achieves memory abstraction, fault tolerance and fast in-memory computation, followed by Job Scheduling and Memory Management.

I will give a demonstration of PySpark, the Python API built on Spark, show different operations (map, reduce, sort, filter) on a large dataset. Then I plan to do a head-to-head comparison between two programs doing same work - one written in mrjob for Hadoop and the other written using PySpark.

I will conclude the presentation talking about the companies currently using Spark worldwide and their use cases.

Requirements

If you would like to try out example programs along with the demo, you will need:

Moderately powerful laptop (at least 4GB of RAM)

Linux Machine

Python 2.7

Pre-compiled local Spark installation (follow http://spark.incubator.apache.org/docs/latest/quick-start.html)

I will share the datasets to be used with the slide.

Speaker bio

I work as a Data Engineer at DataWeave, a Big Data startup based in Bangalore in retail and e-commerce domain. I finished my Masters Data Science from IIIT-Bangalore this year.

I am a committer at Apache Spark project. My main contributions are towards PySpark and improving the performance of Python API along with building programming guides and documentations for PySpark to encourage Python programmers to use Spark.

Comments

Login with Twitter or Google to leave a comment →