Big Data Analysis using PySpark
Shagun Sodhani (~shagunsodhani) |
Apache Spark™ is a fast and general engine for large-scale data processing. It runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk and combines SQL, streaming, and complex analytics.
PySpark is the Python binding for Apache Spark. In the talk, I would focus on what Spark is all about and the different ways it can be used with Python - scripts/shell/Jupyter. Then I would move on to how Spark (PySpark) can be leveraged for Big Data Analytics and how does it provide an improvement over other data processing engines like Apache Hadoop. For demo purpose, I would be using the publicly available StackOverflow Data to demonstrate how we can use PySpark to draw interesting insights from StackOverflow and its sister sites. For example, did you know that there is a high positive correlation between the reputation of a user and number of upvotes he has cast?
Attendees should be comfortable with Python.
The code and presentation (along with detailed explanation) are available here.
I am a developer working with the Analytics team at Adobe Systems. I have been actively using Spark for the past 1 year and Python for past 4 years. I am helping as a teaching assistant for the 3-course series titled Data Science and Engineering with Spark XSeries, created in partnership with professors from University of California, Berkeley, University of California, Los Angeles and Databricks and offered on the edX platform. The course is primarily focused on using PySpark for big Data Analysis. I have previously given talks on Spark at Big Data Training Program, IIT Roorkee and at PyDelhi Meetup