Beyond one-hot encoding: Boosting your model performance

Rahul Bhatia (~rbhatia46)




  • As a Data Scientist, one of the most fundamental issues to tackle before modelling is data cleaning, which includes data of all types, including Categorical data. Machine learning models are hungry for numbers compared to words, but categorical fields are everywhere in real-world datasets. The encoding technique used to encode these categories can have a significant impact on the model performance, which also depends on the machine learning algorithm used with a particular encoding mechanism(for example, a decision tree would be a really bad choice for dataset with a lot of categorical variables, when one-hot encoded). The aim of this talk is to demonstrate the methods to encode categorical data, ideas regarding when one should use a particular method, which brings us one stop closer to better feature engineering. This is highly inspired by my experience working on Data Science problems at industry-level across various domains, where there is a lot of messy data, particularly a lot of messy categorical data.

  • After attending the talk, participants would be able to understand more techniques rather than simply one-hot-encoding categorical data, which might not be the best use-case. Learn the pros and cons of each encoding technique covered. Learn about the following techniques, along with a code-walkthrough with the implementation in Python :

  • Ordinal/Label Encoding
  • One-hot encoding
  • Count encoding
  • Target/Response Encoding
  • Smoothed Target Encoding
  • Probability Ratio Encoding
  • Weight of Evidence
  • Rare Label Encoding
  • Feature Hashing
  • Embedding

Speaker Info:

Rahul Bhatia is a Machine Learning Engineer at Rakuten and has been into Data Science and AI for a long time. He has been active Open Source Contributor, and has been a Google Summer of Code mentor for Public Lab in 2019, prior to that a Google Code-in mentor in 2018. He has spoken at PyCon MY 2019, prior to this and has been active in the community for a long time, he targets at helping organisations generate business value from their data by developing data-driven solutions at scale.

Speaker Links:


Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: