Advanced Data Science Practices using PySpark

Chinukapoor kapooor (~chinukapoor)


The Advent of Apache Spark has more or less replaced the Map-Reduce paradigm for solving data mining problems. Python provides it its own libraries that help the application to use the benefits of in-memory computations. Here, in this workshop, we will go through the basics of Apache Spark, talk about its inherent architecture which makes it desirable for any Data Scientist and we will do a hands-on session on using the PySpark to solve some advanced problems pertaining to data mining.

Also, we will take a dataset off the shelf, do data imputation, basic EDA and move on to predictive/statistical modeling (K-means, Mixture of Gaussians, etc) using PySpark.

Total Duration: 100 mins

The specific topics that will be covered are:

  1. Introduction to the PySpark API (Why SparkSession is the easiest entry point to access all of Spark's functionality) [15 mins]
  2. RDDs vs DataFrames (When to use RDDs and when to use DataFrames and Yes we still use RDDs) - [20 mins]
  3. SparkML/Spark MLlib - Lets clear the confusion. (Trust me this will bug you while writing programs) - [15 mins]
  4. Basic + Advanced functions of PySpark (a. Guidelines for when to use what b. Why ReduceByKey > GroupByKey c. And should you consider using SparkSQL extensively) - [20 mins]

  5. A run-down of a sample program that uses all of the above concepts - [15 mins]

  6. A sneak preview to Spark MongoDB connectors - [15 mins]

Teaching Platform: Jupyter Notebooks

Learning Outcomes:

  1. You will probably now prefer using DataFrames and not Pandas for your next Data Analysis Problem
  2. Understand how the Data Pipeline really works end-to-end
  3. Understand the nuances of Spark and will improve upon the general practices of writing Spark code.


  1. Basic understanding of Python
  2. Basic concepts of Data Mining, Data imputation, and data cleaning.
  3. Very basic understanding of some famous machine learning algorithms
  4. Procedural Programming

Content URLs:

Tableau Public Viz:

IEEE Research Paper on IoT and Image Processing:

Award-winning Poster on Image Processing and IoT represented at 2nd Agricultural and Climate Change Conference in Barcelona, Spain:

Speaker Info:

  1. Graduate of Business Analytics from the Indian School of Business, Hyderabad
  2. Passionate about Python and solving business problems.
  3. Represented India at 2nd Agricultural and Climate Change Conference in Barcelona, Spain.
  4. Author of 3 Research papers using the concepts of Image Processing, IoT and Machine Learning in the field of Agriculture.
  5. 3+ years of experience in the field of core Data Analytics mostly helping Telecom client across Africa.

Speaker Links:

Ayush Kapoor has presented Research Papers at various International and National level conferences in institutes like RVCE, Bangalore, RNSIT, Bangalore, etc.

Id: 1228
Section: Data Science, Machine Learning and AI
Type: Workshop
Target Audience: Advanced
Last Updated: