Analytics - Big AND Small data

AbdealiJK (~AbdealiJK)


Description:

The small-data/big-data problem:

Many people do analytics everyday using Excel, SAS, R, Python/Pandas, etc. And they all have that one big issue ... how to handle big data:

  • I can only do analysis on 1000 clients at a time with this machine
  • Ugh, I have to use that file ... its 500MB ... 30mins to open it. (time for lunch!)
  • The sample of records I did my analytics on was not the right representative of the population

And the primary reason for these is they do not use the right tools to handle the entire data they have. And if they try to use the Big Data tools - it generally doesn't work out, as Big Data is fairly complicated, needs to be learnt, and can be confusing. The biggest confusion point that is difficult to overcome is:

"I ran this calculation on 100 records, it took 5 mins in Spark, while it took 2 seconds in Pandas/R/Excel/etc. !!! This technology is horribly slow! (heh...)"

What the talk aims to explain is:

  • Why Python is great for analysts to use any data size - All the way from PetaBytes to KiloBytes
  • How to structure and write code which can run in any data size within the expected time frames
  • How to do analytics such that it can be done with ease on small samples and can scale to Big Data with minimal effort

The talk focuses on doing this using some very useful features that have been released in the past few months with Spark 2.4, how to leverage PyArrow, Some common syntax that can help with Pandas and Spark interoperability, and the recently introduced Koalas library.

Outline of the talk:

  • Problem with using "Big Data Analytics" on small data [2mins]
  • Why Python is ideal to solve this problem, and it's role in the analytical market today [3mins]
  • How to start thinking about writing code for ALL Data [10mins]
  • Things that are not-so-easy to think about in ALL Data [10mins]
  • Internal working of these solutions [5mins]

Prerequisites:

  • Basic knowledge that python is useful for analytics
    • That a library like Pandas exists which helps in doing analytics for small data
    • Knowing that a Big Data buzzword called Spark exists and has been used by many to handle Big Data solutions, and want to get started on it

This talk will be useful for:

  • Beginners to understand why python is great for analytics
  • Intermediate users who know how to do analytics for small data with Python and want to start using Spark and Big Data tools
  • Advanced users who are creating pipelines for Big Data and have a lot of complaints about the speed for small data (which for some reason everyone assumes is linear!) and how they can optimize it

Speaker Info:

Abdeali Kothari - a.k.a Ali (if talking) or @AbdealiJK (if texting), graduated from IIT Madras and then worked with American Express, followed by Corridor Platforms where he is architecting a Decisioning platform for analytics in the Financial domain. He has worked on training programs since his student days where he spent time to create training programs and summer school initiatives on Web Frameworks, Python and Big Data. And held workshops on "Getting started with Open Source" for beginners.

He specializes in creating solutions in the Big Data, Machine Learning, and Analytics world. In the open source world he has been involved in the coala, Jupyter, and Wikimedia communities. And worked on many other projects related to xgboost, conda, pandas, sqlalchemy, etc. His love for python started nearly 9 years ago (yes, when the first blog post was written about moving away from python 2.7) with Robotics and Game Development.

Speaker Links:

Profiles:

GSoC 2015 - GNOME + coala

GSoC 2016 - Wikimedia + PyWikibot

Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: