A Comprehensive Overview of dealing with Imbalanced Datasets in Python

iNeil77


4

Votes

Description:

Classification algorithms are known to under perform when faced with data that is heavily skewed towards one class as most of them are designed to work under assumptions of uniform class distribution. Another such caveat is the assumption of uniform cost of misclassification of all samples. For instance in a transaction fraud detection setting, the fraudulent transactions are vastly outnumbered by the genuine ones. Also the cost of wrongly classifying a fraudulent transaction as a genuine one far outstrips the inconvenience caused by flagging a benign transaction as a malicious one.

This talk aims to cover the various approaches used to cope with this commonly faced problem:

  1. Oversampling Methods
  2. Undersampling Methods
  3. Synthetic Data Generation
  4. Cost Sensitive Learning

Key takeaways from this talk:

  1. How imbalanced data sets undermine classifier performance
  2. How to eliminate class imbalance
  3. The advantages and disadvantages of over/under sampling and synthetic data generation
  4. Robust evaluation metrics insensitive to class imbalance

Prerequisites:

  • Basic Python
  • Understanding of basic performance evaluation metrics

Content URLs:

Slides

Deck: https://slides.com/ineil77/deck/fullscreen

References

Imbalanced Learn Python Library: http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html

Speaker Info:

I'm Indraneil Paul, a final year Computer Science student at IIIT Hyderabad. I have been involved in machine learning, computer vision and mathematical optimisation for the best part of the past three years due to my research work. I was previously working in the Computer Vision lab on an autonomous driving project and am currently working on applying graph based machine learning models to social networks. I was also a Google Summer of Code '17 student under electric vehicle startup Green Navigation (now nav-e).

Speaker Links:

Github: https://github.com/iNeil77

Section: Data science
Type: Talks
Target Audience: Beginner
Last Updated: