Dealing with categorical features is a common thing to preprocess before building machine learning models. In real-life data science scenario, it means that the dataset has an attribute stored as text such as days of the week(Monday, Tuesday,.., Sunday), time, colour(Red, Blue, …), or place names, etc.
Categorical features have a lot to say about the dataset thus it should be converted to numerical to make it into a machine-readable format. Focusing only on numerical variables in the dataset isn’t enough to get good accuracy. Often categorical variables prove to be the most important factor and thus identify them for further analysis. Most of the machine learning algorithms do not support categorical data, only a few as ‘CatBoost’ do.
There are a variety of techniques to handle categorical data which I will be discussing in this article with their advantages and disadvantages.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Identifying Categorical Variables (Types):
Two major types of categorical features are
- Nominal – These are variables which are not related to each other in any order such as colour (black, blue, green).
- Ordinal – These are variables where a certain order can be found between them such as student grades (A, B, C, D, Fail).
The dataset I’m going to work with is: Melbourne housing price dataset from Kaggle https://www.kaggle.com/dansbecker/melbourne-housing-snapshot/home?select=melb_data.csv
Let’s Explore the Dataset
import pandas as pd df = pd.read_csv('/content/drive/My Drive/melb_data.csv') df.head()
The dataset contains 13580 rows and 21 columns.
Let’s get the categorical data out of training data and print the list. The object dtype indicates a column has text.
s = (df.dtypes == 'object') object_cols = list(s[s].index) print("Categorical variables:") print(object_cols)
Categorical variables:['Suburb', 'Address', 'Type', 'Method', 'SellerG', 'Date', 'CouncilArea', 'Regionname']
For simplicity, I’ve taken up only 3 categorical columns to illustrate encoding techniques.
features = df[['Type','Method','Regionname']] features.head()
Handling Categorical Variables
Replace using the map function
This could be a very basic approach to manually replace categorical values with custom values.
mapping = {'h':1, 'u':2, 't':3 } features['type'] = features.Type.map(mapping) features.type.value_counts()
OUTPUT
1 9449 2 3017 3 1114 Name: type, dtype: int64
This is not a good approach when it comes to large categories.
Label Encoding
Label encoding can uniquely number the different categories from 0 to n-1. Thus also termed as Integer encoding. LabelEncoder class from the scikit-learn library is used for this purpose.
Before Label Encoding:
features.Regionname.value_counts()
Southern Metropolitan 4695 Northern Metropolitan 3890 Western Metropolitan 2948 Eastern Metropolitan 1471 South-Eastern Metropolitan 450 Eastern Victoria 53 Northern Victoria 41 Western Victoria 32 Name: Regionname, dtype: int64
After label Encoding:
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df1 = features[['Regionname']] df1['Region'] = le.fit_transform(features['Regionname']) df1.value_counts() OUTPUT: Regionname Region Southern Metropolitan 5 4695 Northern Metropolitan 2 3890 Western Metropolitan 6 2948 Eastern Metropolitan 0 1471 South-Eastern Metropolitan 4 450 Eastern Victoria 1 53 Northern Victoria 3 41 Western Victoria 7 32 dtype: int64
fit_transform(y) – fits the label encoder and then returns encoded labels.
LabelEncoder cannot handle missing values so it’s important to impute them. LabelEncoder can be used to store values using less disk space. This is simple to use and works well on tree-based algorithms. It cannot work for linear models, SVMs, or neural networks as their data needs to be standardized.
One Hot Encoding
One hot encoding is a binary encoding applied to categorical values. To increase performance one can also first perform label encoding then those integer variables to binary values which will become the most desired form of machine-readable.
Pandas get_dummies() converts categorical variables into dummy/indicator variables.
Before OHE:
features.Method.value_counts() S 9022 SP 1703 PI 1564 VB 1199 SA 92 Name: Method, dtype: int64
After OHE:
df2 = pd.get_dummies(features['Method']) df2.value_counts() OUTPUT: PI S SA SP VB 0 1 0 0 0 9022 0 0 1 0 1703 1 0 0 0 0 1564 0 0 0 0 1 1199 1 0 0 92 dtype: int64
One hot encoding overcomes the limitations of label encoding and can be used in both tree-based and non-tree-based machine learning algorithms. The disadvantage is that for high cardinality, the feature space can really blow up quickly. The binary variables are often called “dummy variables” in statistics.
Label Binarizer
Scikit-learn also supports binary encoding by using the LabelBinarizer. We use a similar process as above to transform the data for the process of creating a pandas DataFrame.
from sklearn.preprocessing import LabelBinarizer lb_style = LabelBinarizer() lb_results = lb_style.fit_transform(features["Type"]) pd.DataFrame(lb_results, columns=lb_style.classes_).value_counts() OUTPUT: h t u 1 0 0 9449 0 0 1 3017 1 0 1114 dtype: int64
Count/Frequency Encoder
Another way to refer to variables that have a multitude of categories is to call them variables with high cardinality. If we have categorical variables containing many multiple labels or high cardinality, then by using one-hot encoding, we will expand the feature space dramatically. Replacing categorical variables with their frequency this is the number of times each label appears in the dataset.
Before encoding:
20 h 9449 21 u 1114 22 t 3017
After encoding:
df_frequency_map = features.Type.value_counts().to_dict() features.Type = features.Type.map(df_frequency_map) features.Type.iloc[20:23] OUTPUT: 20 9449 21 1114 22 3017 Name: Type, dtype: int64
It is very simple to implement and does not increase the feature dimensional space. But if some of the labels have the same count, then they will be replaced with the same count and they will lose some valuable information. Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power.
Conclusion
There are many more ways by which categorical variables can be changed to numeric I’ve discussed some of the important and commonly used ones. Handling categorical variables is an important step for feature engineering. New variables can be formed by categorical variables and get more insight into the dataset.
The complete code of the above implementation is available in the AIM’s GitHub repository. Please visit this link to find the notebook of the above codes.