Distributed Machine learning in python using Apache Spark

Satwik Kansal (~satwikkansal)


The workshop covers the core concepts of Apache Spark, by implementing Machine learning algorithms from scratch in PySpark that can be run in a distributed fashion. We'll implement algorithms and techniques like OLS regression, K-fold cross-validation, and PageRank to practically understand the concepts involved in creating a distributed machine learning pipeline.

Rough outline

10 mins

  • What is distributed computing?
  • Model parallelism and data parallelism

10 mins

  • Map-reduce programming paradigm
    • What is it?
    • A simple example?
    • When is it effective?

10 mins

  • Apache Spark
    • Some interesting history
    • Spark Architecture (Driver, Spark Context, Cluster Manager, Executors, and tasks)

30 mins

  • PySpark (Python Language API for Spark)
    • PySpark hello world (word count using map reduce)
    • Practically explaining concepts like
      • RDDs and lazy execution
      • Execution phases (Jobs, stages, and tasks)
    • Must know spark operations (transforms and actions)
    • Familiarizing with Spark Dashboard to visualize even tiniest of execution details

50 mins

  • Machine learning with PySpark
    • Writing distributed linear regressions (from scratch)
      • Data preparation
      • EDA
      • Defining baseline loss
      • OLS loss
      • Gradient descent, finally!
      • Training and visualizing

20 mins

  • A bit more complications; K-fold cross-validation
    • Framing complex solutions in MapReduce terms
    • Caching, broadcasting and some lesser used but useful constructs.

20 mins

  • Scaling up!
    • Spark ML and MLLib
    • Cloud environments for pyspark (from providers like Amazon, Microsoft, and Google)


The attendees should have familiarity with python. And wrt logistics, it'd be great if you've PySpark already installed in your system.

