Inferential Statistics with Python
Rounak Banik (~rounakbanik) |
Inferential Statistics is the art of making conclusions and predicting outcomes from data. It is an incredibly important component of exploratory data analysis and A/B testing.
In this talk, we will be giving you a brief overview of the major theories underlying inferential statistics, its many tools and techniques and its implementation using Python. Through the course of the talk, we will also be walking you through three real world datasets and giving you a taste of how to proceed with gaining insights from your data through hypothesis testing and data visualisation.
Our talk has the following contents:
- What is Statistics? The Difference between Descriptive and Inferential Statistics
- A brief primer on Descriptive Statistics: Central Tendencies, Binomial and Normal Distributions, Z-Scores.
- The importance of Sampling. Various kinds of sampling bias. Quality and quantity of sampled data.
- Estimation of a population proportion and mean. Sampling error, confidence intervals. Central Limit Theorem.
- Basics of Hypothesis Testing
- One Sample and Two Sample Significance Tests, Chi Square Significance Test
- Correlation, Scatter plots and Linear Regression
- Which Statistical Test to use on what kind of data
- Statistical and Practical Significance of test results
To demonstrate the above concepts, we will be implementing the methods in Python and working through synthetic data as well as real world datasets.
- Exploring Literacy Rates in Punjab and Delhi: From data retrieved from Kaggle, we will try to determine if there is a significant difference in literacy rates of Punjab and Delhi.
- NBA Player Heights: From a sample of NBA Players, we will try to find out if the mean height is actually 6'7" as reported by most publications.
- Suicide Rates in India: From the suicide statistics between 2001-11, we will try to determine if men are as likely as women to commit suicide.
- Do Men and Women prefer certain countries to book Airbnbs in: We will use Airbnb's data to deduce if there is a relationship between sex and country preference for booking Airbnbs.
- Olympian Weights: We will try to estimate the average weight of Olympians given a small sample.
- Credit Card Fraud: We will try and estimate the fraction of fraudulent transactions given a small subset of the data.
No prerequisites besides a basic understanding of High School Level Probability and Statistics. The following are desirable but not required.
- Basics of Pandas and Numpy.
- An undergraduate course in Descriptive Statistics (Different kinds of distributions, their means and variances, etc.)
To follow along, it is highly recommended that the audience have Jupyter installed as we will be walking through the code through Jupyter Notebooks. Also, make sure you have the following libraries installed (a sufficiently later version will do):
For the non data scientists, it is highly recommended that you install the Anaconda distribution for your OS as it comes neatly packed with the notebook software and the requisite Python libraries.
Note: The content is currently in development. Suggestions and improvements are welcome.
- Code: https://github.com/rounakbanik/inferential_stats_pycon
- Slides: https://www.slideshare.net/rounakbanik/inferential-stats
- Transcript: https://docs.google.com/document/d/1jrl2Cvh42ByAKPVN6K8NKUeD1Al8sdhanmUr1IBpxzE/edit?usp=sharing
Rounak Banik is a final year undergraduate at IIT Roorkee. Although currently pursuing Electronics and Communication Engineering, his professional interests lie in Web Development and Data Science. He has previously interned as a Software Engineer at Parceed, a New York based startup and Springboard, a Data Science EdTech startup based in San Francisco and Bangalore. He also worked as a Backend Development Instructor with Acadview, teaching Python and Django to around 35 college students from Delhi and Dehradun. He is currently working directly under the Director of IIT Roorkee and Dr. Durga Toshniwal for his B.Tech Project on Fake News and Review Detection. He is also a student of Springboard's Data Science Career Track, being mentored directly by Baran Toppare, former Lead Data Scientist at Getir.
Apoorva Agarwal is a third year undergraduate pursuing Chemical Engineering at IIT Roorkee. She has previously interned as a Machine Learning Engineer at Indifi, a FinTech startup based in Gurgaon. She attended a Deep Learning Summer School at IIT Kharagpur and took part in the 7 day Deep Learning for Visual Computing Conference at IIT KGP. She is currently the Editor in Chief of Geek Gazette, the technical magazine of IIT Roorkee and a member of the Data Science Group (DSG) at IIT Roorkee.
- Linkedin: https://www.linkedin.com/in/rounakbanik/
- Github: https://github.com/rounakbanik
- Linkedin: https://www.linkedin.com/in/apoorva-agarwal-455a69145/