January 4, 2023

Categorical Data

The article published in Towards Data Science by Andrew Engel explains clearly about Categorical Variables for Machine Learning Algorithms. I advice you to visit and go over eyes this article

While most machine learning algorithms only work with numeric values, many important real-world features are not numeric but rather categorical. As categorical features, they take on levels or values. These can be represented as various categories such as age, state, or customer type for example. Alternatively, these can be created by binning underlying numeric features, such as identifying individuals by age ranges (e.g., 0–10, 11–18, 19–30, 30–50, etc.). Finally, these can be numeric identifiers where the relationship between the values is not meaningful. ZIP codes are a common example of this. Two ZIP codes close numerically may be farther apart than another ZIP code that is distant numerically.

Since these categorical features cannot be directly used in most machine learning algorithms, the categorical features need to be transformed into numerical features. While numerous techniques exist to transform these features, the most common technique is one-hot encoding.

In one-hot encoding, a categorical variable is converted into a set of binary indicators (one per category in the entire dataset). So in a category that contains the levels clear, partly cloudy, rain, wind, snow, cloudy, fog, seven new variables will be created that contain either 1 or 0. Then, for each observation, the variable that matches the category will be set to 1 and all other variables set to 0.

Let us watch the video tutorial from codebasics channel about Categorical variables, aka Dummies and One Hot Encoding

The code form the tutorial is in github https://github.com/codebasics/py/blob/master/ML/5_one_hot_encoding/one_hot_encoding.ipynb

Also the article in www.kdnuggets.com by Shelvi Garg apparently shows how to Deal with Categorical Data for Machine Learning.

Here is the list of the 15 types of encoding the library supports:

  • One-hot Encoding
  • Label Encoding
  • Ordinal Encoding
  • Helmert Encoding
  • Binary Encoding
  • Frequency Encoding
  • Mean Encoding
  • Weight of Evidence Encoding
  • Probability Ratio Encoding
  • Hashing Encoding
  • Backward Difference Encoding
  • Leave One Out Encoding
  • James-Stein Encoding
  • M-estimator Encoding
  • Thermometer Encoder
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

Visit the webpage for understanding categorical encoding methods and having a better idea about topic.