Let's Learn Statistics !

Bargava Subramanian (~bargava) | 03 May, 2015

31

Votes

Description:

Statistics has some important concepts and thought processes that drive Data Science. But is Statistics an arcane mathematical subject filled with esoteric formulae and concepts - and hence, difficult to learn ? We feel not.

BUT?!!

"I am a programmer", "math is not my cup of tea", "It's been ages since I did math. I don't know if I am capable of doing it", "WTH? I thought everything is commoditized/productized. So, why learn statistics?" We hear ya !

Why don't we take an application-centric programming approach to learn some of the basic concepts that drive data science? Is it possible? Most definitely.

Heavily inspired by Allen Downey's books Think Stats and Think Bayes, and also his Pycon US workshop(s), we try to demystify some of those concepts using some real-life examples. Some key concepts that we plan to cover are:

Standard Deviation, Variance, Co-variance (Assumption: Hoping everyone knows a bit about mean, median, mode :) )
Probability distribution
What is hypothesis testing?
What are t-test, p-value, chi-squared test, confidence intervals ?
Correlation
Confidence level and Significance level
Re-sampling and its relevance in the world of Big Data
What is A/B testing?
A simple linear regression model

We would be doing data analysis using Pandas along with numpy and scipy. We would be doing some plotting using matplotlib/seaborn.

We would be using IPython Notebook to drive the workshop. The contents of the workshop are available at the repo: https://github.com/rouseguy/intro2stats . It is currently a work-in-progress. All the code, data and presentations would be available in this repository prior to the workshop

Prerequisites:

Technical/Software Knowledge

Basics of Python (Must) : Attendees should know how to write functions; read in a text file(csv, txt, fwf) and parse them; conditional and looping constructs; using standard libraries like os, sys; lists, list comprehension, dictionaries
Introduction to Pandas, Numpy, Scipy (Good to have).

Links to get started on all of them are given below in the Content urls section.

Software Requirements-Must have

Python 2.7
git

Software Requirements-Recommended

We would be cloning a git repo and working off it. Link to that will be posted closer to the workshop date. There will be a requirements file that, when executed, will install all necessary libraries. For sake of completeness, we would need the latest versions of the following libraries:

Numpy
Pandas
Scipy
Matplotlib
Seaborn
IPython (along with IPython notebook)

Software-Optional

If attendees are comfortable, they can install and use Anaconda. If using Anaconda, prior to the start of workshop, please verify if all the requisite libraries are installed. Disclosure I use Anaconda

Content URLs:

Workshop Repo- Introduction to Statistics
Introduction to Pandas
Introduction to Numpy and Scipy
Introduction to Python
Introduction to Statistics by Allen Downey - Book
Introduction to Statistics by Allen Downey - Pycon 2105, Montreal - Video
Introduction to Bayesian Statistics by Allen Downey - Book

Speaker Info:

Bargava Subramanian is a Senior Statistician at Cisco. He has a Masters from University of Maryland, College Park, USA.
Raghotham is a full-stack developer at RedMart. He has a Masters from BITS, Pilani.

Speaker Links:

Introduction to Classification Methods in Machine Learning, Fifth Elephant 2014, Bangalore
Data processing using Blaze, BangPypers Jan 2015, Bangalore
Visualization Libraries in Python, BangPypers Apr 2015, Bangalore

Section:	Data Visualization and Analytics
Type:	Workshops
Target Audience:	Beginner
Last Updated:	13 Aug, 2015

Comments