Beyond one-hot encoding: Boosting your model performance

Rahul Bhatia (~rbhatia46)


2

Votes

Description:

  • As a Data Scientist, one of the most fundamental issues to tackle before modelling is data cleaning, which includes data of all types, including Categorical data. Machine learning models are hungry for numbers compared to words, but categorical fields are everywhere in real-world datasets. The encoding technique used to encode these categories can have a significant impact on the model performance, which also depends on the machine learning algorithm used with a particular encoding mechanism(for example, a decision tree would be a really bad choice for dataset with a lot of categorical variables, when one-hot encoded). The aim of this talk is to demonstrate the methods to encode categorical data, ideas regarding when one should use a particular method, which brings us one stop closer to better feature engineering. This is highly inspired by my experience working on Data Science problems at industry-level across various domains, where there is a lot of messy data, particularly a lot of messy categorical data.

  • After attending the talk, participants would be able to understand more techniques rather than simply one-hot-encoding categorical data, which might not be the best use-case. Learn the pros and cons of each encoding technique covered. Learn about the following techniques, along with a code-walkthrough with the implementation in Python :

    • Ordinal/Label Encoding
    • One-hot encoding
    • Count encoding
    • Target/Response Encoding
    • Smoothed Target Encoding
    • Probability Ratio Encoding
    • Weight of Evidence
    • Rare Label Encoding
    • Feature Hashing
    • Embedding

Speaker Links:

  • https://rbhatia46.github.io
  • https://www.linkedin.com/in/rahul-bhatia-67ba08121/
  • https://medium.com/@rbhatia46

Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: