Distributed Machine learning in python using Apache Spark

Satwik Kansal (~satwikkansal)


The workshop covers the core concepts of Apache Spark, by implementing Machine learning algorithms from scratch in PySpark that can be run in a distributed fashion. We'll implement algorithms and techniques like OLS regression, K-fold cross-validation, and PageRank to practically understand the concepts involved in creating a distributed machine learning pipeline.

Rough outline

10 mins

  • What is distributed computing?
  • Model parallelism and data parallelism

10 mins

  • Map-reduce programming paradigm
    • What is it?
    • A simple example?
    • When is it effective?

10 mins

  • Apache Spark
    • Some interesting history
    • Spark Architecture (Driver, Spark Context, Cluster Manager, Executors, and tasks)

30 mins

  • PySpark (Python Language API for Spark)
    • PySpark hello world (word count using map reduce)
    • Practically explaining concepts like
      • RDDs and lazy execution
      • Execution phases (Jobs, stages, and tasks)
    • Must know spark operations (transforms and actions)
    • Familiarizing with Spark Dashboard to visualize even tiniest of execution details

50 mins

  • Machine learning with PySpark
    • Writing distributed linear regressions (from scratch)
      • Data preparation
      • EDA
      • Defining baseline loss
      • OLS loss
      • Gradient descent, finally!
      • Training and visualizing

20 mins

  • A bit more complications; K-fold cross-validation
    • Framing complex solutions in MapReduce terms
    • Caching, broadcasting and some lesser used but useful constructs.

20 mins

  • Scaling up!
    • Spark ML and MLLib
    • Cloud environments for pyspark (from providers like Amazon, Microsoft, and Google)


The attendees should have familiarity with python. And wrt logistics, it'd be great if you've PySpark already installed in your system.

Speaker Info:

I'm a Software Dev experienced in Data Science and Decentralized Applications.

My profile page link

Speaker Links:

Scaling, distributed computation, and Machine learning are my core area of interests. And this session is a blend of these three. My previous works in these areas that are public include,

My open source contributions and projects related to python, data-science and decentralized / distributed apps can be found out at my github

Id: 1182
Section: Data Science, Machine Learning and AI
Type: Workshop
Target Audience: Intermediate
Last Updated: