PySpark : Combining Machine Learning & Big Data

Ayon Roy (~ayon)




With the ever-increasing flow of data, comes the industry focus on how to use those data for driving business & insights; but what about the size of the data these days, we have to deal with?

The cleaner data you have, its good for training your ML ( Machine Learning ) models, but sadly neither the world feeds you clean data nor the huge amount of data is capable of fast processing using common libraries like Pandas, etc.

How about using the potential of big data libraries with support in Python to deal with this huge amount of data for deriving business insights using ML techniques? But how can we amalgamate the two?

Here comes “ PySpark: Combining Machine Learning & Big Data “.

Usually, people in the ML domain prefer using Python; so combining the potential of Big Data technologies like Spark etc to supplement ML is a matter of ease with pyspark ( A Python package to use the Spark’s capabilities ).

This talk would revolve around -

1) Why do we need to fuse Big Data with Machine Learning?

2) How Spark’s architecture will help us boost our preparations for faster ML?

3) How pyspark’s MLlib ( Machine Learning Library ) help you do ML so seamlessly?

Outline of the talk

  • What is Big Data & Machine Learning ? : 3 minutes
  • Why do we need to fuse Big Data with Machine Learning ? : 2 Minutes
  • What is Spark & how it’s architecture will help us in doing ML ? : 10 Minutes
  • How to harness the power of Spark using Python ? ( Here comes PySpark ) : 5 Minutes
  • How Spark’s ML library will help us do ML seamlessly using PySpark? : 10 Minutes


Preferable -

  1. High level overview of the Big Data Environment like Spark etc.
  2. Basic understanding of the Machine Learning terminologies like Regression etc

Video URL:

Content URLs:

Find the supporting slide for this talk at

View all my previous talks/podcasts at

Speaker Info:

Ayon has a distinct passion for problem solving using Data Science, ML, AI & loves to wear the cap of a tech speaker passionately.

He has had multiple stints in the field of Data Science,AI,ML through various internships & actively contributes to the society by mentoring hackathons, bootcamps. Till date he has delivered 10+ technical talks at places like International Center for Genetic Engineering & Bio-Technology and have also mentored 15+ hackathons, open source initiatives etc.

Alongside this, he is also the organizer of India's First Kaggle Days Meetup in Delhi NCR and loves to review technical books too.

Speaker Links:

Personal Website



Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: