A Comprehensive Overview of dealing with Imbalanced Datasets in Python
Classification algorithms are known to under perform when faced with data that is heavily skewed towards one class as most of them are designed to work under assumptions of uniform class distribution. Another such caveat is the assumption of uniform cost of misclassification of all samples. For instance in a transaction fraud detection setting, the fraudulent transactions are vastly outnumbered by the genuine ones. Also the cost of wrongly classifying a fraudulent transaction as a genuine one far outstrips the inconvenience caused by flagging a benign transaction as a malicious one.
This talk aims to cover the various approaches used to cope with this commonly faced problem:
- Oversampling Methods
- Undersampling Methods
- Synthetic Data Generation
- Cost Sensitive Learning
Key takeaways from this talk:
- How imbalanced data sets undermine classifier performance
- How to eliminate class imbalance
- The advantages and disadvantages of over/under sampling and synthetic data generation
- Robust evaluation metrics insensitive to class imbalance
- Basic Python
- Understanding of basic performance evaluation metrics
Imbalanced Learn Python Library: http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html
I'm Indraneil Paul, a final year Computer Science student at IIIT Hyderabad. I have been involved in machine learning, computer vision and mathematical optimisation for the best part of the past three years due to my research work. I was previously working in the Computer Vision lab on an autonomous driving project and am currently working on applying graph based machine learning models to social networks. I was also a Google Summer of Code '17 student under electric vehicle startup Green Navigation (now nav-e).