Let's Learn Statistics !

Bargava Subramanian (~bargava)


31

Votes

Description:

Statistics has some important concepts and thought processes that drive Data Science. But is Statistics an arcane mathematical subject filled with esoteric formulae and concepts - and hence, difficult to learn ? We feel not.

BUT?!!

"I am a programmer", "math is not my cup of tea", "It's been ages since I did math. I don't know if I am capable of doing it", "WTH? I thought everything is commoditized/productized. So, why learn statistics?" We hear ya !

Why don't we take an application-centric programming approach to learn some of the basic concepts that drive data science? Is it possible? Most definitely.

Heavily inspired by Allen Downey's books Think Stats and Think Bayes, and also his Pycon US workshop(s), we try to demystify some of those concepts using some real-life examples. Some key concepts that we plan to cover are:

  • Standard Deviation, Variance, Co-variance (Assumption: Hoping everyone knows a bit about mean, median, mode :) )
  • Probability distribution
  • What is hypothesis testing?
  • What are t-test, p-value, chi-squared test, confidence intervals ?
  • Correlation
  • Confidence level and Significance level
  • Re-sampling and its relevance in the world of Big Data
  • What is A/B testing?
  • A simple linear regression model

We would be doing data analysis using Pandas along with numpy and scipy. We would be doing some plotting using matplotlib/seaborn.

We would be using IPython Notebook to drive the workshop. The contents of the workshop are available at the repo: https://github.com/rouseguy/intro2stats . It is currently a work-in-progress. All the code, data and presentations would be available in this repository prior to the workshop

Prerequisites:

Technical/Software Knowledge

  • Basics of Python (Must) : Attendees should know how to write functions; read in a text file(csv, txt, fwf) and parse them; conditional and looping constructs; using standard libraries like os, sys; lists, list comprehension, dictionaries
  • Introduction to Pandas, Numpy, Scipy (Good to have).

Links to get started on all of them are given below in the Content urls section.

Software Requirements-Must have

  • Python 2.7
  • git

Software Requirements-Recommended

We would be cloning a git repo and working off it. Link to that will be posted closer to the workshop date. There will be a requirements file that, when executed, will install all necessary libraries. For sake of completeness, we would need the latest versions of the following libraries:

  • Numpy
  • Pandas
  • Scipy
  • Matplotlib
  • Seaborn
  • IPython (along with IPython notebook)

Software-Optional

If attendees are comfortable, they can install and use Anaconda. If using Anaconda, prior to the start of workshop, please verify if all the requisite libraries are installed. Disclosure I use Anaconda

Speaker Info:

  • Bargava Subramanian is a Senior Statistician at Cisco. He has a Masters from University of Maryland, College Park, USA.
  • Raghotham is a full-stack developer at RedMart. He has a Masters from BITS, Pilani.

Speaker Links:

Section: Data Visualization and Analytics
Type: Workshops
Target Audience: Beginner
Last Updated: