Distributed Machine learning in python using Apache Spark

Satwik Kansal (~satwikkansal)


Description:

The workshop covers the core concepts of Apache Spark, by implementing Machine learning algorithms from scratch in PySpark that can be run in a distributed fashion. We'll implement algorithms and techniques like OLS regression, K-fold cross-validation, and PageRank to practically understand the concepts involved in creating a distributed machine learning pipeline.

Rough outline

10 mins

  • What is distributed computing?
  • Model parallelism and data parallelism

10 mins

  • Map-reduce programming paradigm
    • What is it?
    • A simple example?
    • When is it effective?

10 mins

  • Apache Spark
    • Some interesting history
    • Spark Architecture (Driver, Spark Context, Cluster Manager, Executors, and tasks)

30 mins

  • PySpark (Python Language API for Spark)
    • PySpark hello world (word count using map reduce)
    • Practically explaining concepts like
      • RDDs and lazy execution
      • Execution phases (Jobs, stages, and tasks)
    • Must know spark operations (transforms and actions)
    • Familiarizing with Spark Dashboard to visualize even tiniest of execution details

50 mins

  • Machine learning with PySpark
    • Writing distributed linear regressions (from scratch)
      • Data preparation
      • EDA
      • Defining baseline loss
      • OLS loss
      • Gradient descent, finally!
      • Training and visualizing

20 mins

  • A bit more complications; K-fold cross-validation
    • Framing complex solutions in MapReduce terms
    • Caching, broadcasting and some lesser used but useful constructs.

20 mins

  • Scaling up!
    • Spark ML and MLLib
    • Cloud environments for pyspark (from providers like Amazon, Microsoft, and Google)

Prerequisites:

The attendees should have familiarity with python. And wrt logistics, it'd be great if you've PySpark already installed in your system.

Speaker Info:

I'm a Software Dev experienced in Data Science and Decentralized Applications.

My profile page link

Speaker Links:

Scaling, distributed computation, and Machine learning are my core area of interests. And this session is a blend of these three. My previous works in these areas that are public include,

My open source contributions and projects related to python, data-science and decentralized / distributed apps can be found out at my github

Id: 1182
Section: Data Science, Machine Learning and AI
Type: Workshop
Target Audience: Intermediate
Last Updated: