Introduction to PySpark: Using Apache Spark with Python
Jaidev Deshpande (~jaidev) |
Large scale and distributed data processing, for the most part, has not been fully available to Pythonistas (for a number of reasons). Therefore, while the Python ecosystem boasts of some of the best and most sophisticated data analysis libraries, it has not been able to fully harness parallelism and distributed computing for big data problems. This is not to say that big data problems are unsolvable within Python. They certainly are, and quite a few libraries have made decent attempts at doing so. But these solutions rarely come with the ease of development that makes Python so awesome. By all indications, PySpark changes that.
PySpark is the Python API for Apache Spark (http://spark.apache.org), a general purpose cluster computing framework. This tutorial is a hands-on tour of PySpark. In this tutorial, participants will be able to solve large scale data processing problems armed with only a little more than Python primitives.
The tutorial will cover:
- Basics of Spark
- Scripting and writing standalone applications for Spark with PySpark
- Using Spark's mllib module to run machine learning jobs
- Using Spark for processing streaming data
- Basic Python data structures
- Basic knowledge of Pandas dataframes and SQL
- Knowledge of common data storage formats like JSON, delimiter separated files, HDFS, etc
- Entry-level machine learning (optional, a 101-like exposure will suffice.)
- Apache Spark (Downloadable from http://spark.apache.org/downloads.html)
- A Python distribution containing IPython, Pandas and Scikit-learn
(Something like Enthought Canopy or Anaconda will be ideal.)
Note: Pandas and scikit-learn are required only to highlight some features of Spark by comparison. The tutorial will contain of a couple of examples from each of these libraries, and demonstrate how the same tasks can be performed with Spark.
Software Setup: Some additional setup is required to configure Spark for flexible usage with the standard Python installations. I will soon upload notes detailing this setup process.
IPython notebooks and slides will be uploaded soon.
I'm a data scientist at DataCulture Analytics (http://dataculture.io) where I build large scale machine learning applications. Previously I've worked at Enthought, Inc, where I was one of the developers of the Canopy data analysis platform. I've been a research assistant in the fields of machine learning and signal processing at the Tata Institute of Fundamental Research and the University of Pune. I love developing GUI apps and signal processing tools in my free time.
- GitHub: http://github.com/jaidevd
- Twitter: http://twitter.com/jaidevd
- AboutMe: http://about.me/jaidevd