Introduction to PySpark
Shashijeevan M.P. (~shashijeevan) |
Apache Spark is an open-source Distributed Computational Framework. It sits on top of Cluster Manager and Distributed Storage. Spark program runs in driver and utilizes Cluster manager to run tasks.
Apache Spark has become the most preferred option in the field of Machine Learning due to its faster processing utilizing in-memory computations with Resilient Distributed Dataset (RDD). With the Python being the most preferred language for Machine Learning and Deep Learning tasks, PySpark has become most important weapon in the arsenal of Data Scientists/Data Engineers.
PySpark is Python API to the Scala Core of Spark allowing Python programmers access to run Distributed jobs in Spark.
This session will introduce you Spark architecture and show how to use PySpark to run Machine Learning tasks on Spark.
Knowledge of Machine Learning
Knowledge of Python
Will provide the links soon.
Shashi Jeevan is an author, trainer, architect with over two decades of experience in the software industry working in various domains including Finance, Digital Signage, Rich Media Management, etc. He loves to master new technologies and share his learnings. He regularly presents and organizes free technical sessions through the Hyderabad Software Architects meetup group which he founded in 2015.