Let's Learn Statistics !
Bargava Subramanian (~bargava) |
Statistics has some important concepts and thought processes that drive Data Science. But is Statistics an arcane mathematical subject filled with esoteric formulae and concepts - and hence, difficult to learn ? We feel not.
"I am a programmer", "math is not my cup of tea", "It's been ages since I did math. I don't know if I am capable of doing it", "WTH? I thought everything is commoditized/productized. So, why learn statistics?" We hear ya !
Why don't we take an application-centric programming approach to learn some of the basic concepts that drive data science? Is it possible? Most definitely.
Heavily inspired by Allen Downey's books Think Stats and Think Bayes, and also his Pycon US workshop(s), we try to demystify some of those concepts using some real-life examples. Some key concepts that we plan to cover are:
- Standard Deviation, Variance, Co-variance (Assumption: Hoping everyone knows a bit about mean, median, mode :) )
- Probability distribution
- What is hypothesis testing?
- What are t-test, p-value, chi-squared test, confidence intervals ?
- Confidence level and Significance level
- Re-sampling and its relevance in the world of Big Data
- What is A/B testing?
- A simple linear regression model
We would be doing data analysis using Pandas along with numpy and scipy. We would be doing some plotting using matplotlib/seaborn.
We would be using IPython Notebook to drive the workshop. The contents of the workshop are available at the repo: https://github.com/rouseguy/intro2stats . It is currently a work-in-progress. All the code, data and presentations would be available in this repository prior to the workshop
- Basics of Python (Must) : Attendees should know how to write functions; read in a text file(csv, txt, fwf) and parse them; conditional and looping constructs; using standard libraries like os, sys; lists, list comprehension, dictionaries
- Introduction to Pandas, Numpy, Scipy (Good to have).
Links to get started on all of them are given below in the Content urls section.
Software Requirements-Must have
- Python 2.7
We would be cloning a git repo and working off it. Link to that will be posted closer to the workshop date. There will be a requirements file that, when executed, will install all necessary libraries. For sake of completeness, we would need the latest versions of the following libraries:
- IPython (along with IPython notebook)
If attendees are comfortable, they can install and use Anaconda. If using Anaconda, prior to the start of workshop, please verify if all the requisite libraries are installed. Disclosure I use Anaconda
- Workshop Repo- Introduction to Statistics
- Introduction to Pandas
- Introduction to Numpy and Scipy
- Introduction to Python
- Introduction to Statistics by Allen Downey - Book
- Introduction to Statistics by Allen Downey - Pycon 2105, Montreal - Video
- Introduction to Bayesian Statistics by Allen Downey - Book
- Bargava Subramanian is a Senior Statistician at Cisco. He has a Masters from University of Maryland, College Park, USA.
- Raghotham is a full-stack developer at RedMart. He has a Masters from BITS, Pilani.
- Introduction to Classification Methods in Machine Learning, Fifth Elephant 2014, Bangalore
- Data processing using Blaze, BangPypers Jan 2015, Bangalore
- Visualization Libraries in Python, BangPypers Apr 2015, Bangalore