Last updated March 1, 2021
In AI Mysteries

Complete Guide To Handling Categorical Data Using Scikit-Learn

Dealing with categorical features is a common thing to preprocess before building machine learning models.There are a variety of techniques to handle categorical data which I will be discussing in this article with their advantages and disadvantages.

Share

Published on October 14, 2020

by Jayita Bhattacharyya

Dealing with categorical features is a common thing to preprocess before building machine learning models. In real-life data science scenario, it means that the dataset has an attribute stored as text such as days of the week(Monday, Tuesday,.., Sunday), time, colour(Red, Blue, …), or place names, etc.

Categorical features have a lot to say about the dataset thus it should be converted to numerical to make it into a machine-readable format. Focusing only on numerical variables in the dataset isn’t enough to get good accuracy. Often categorical variables prove to be the most important factor and thus identify them for further analysis. Most of the machine learning algorithms do not support categorical data, only a few as ‘CatBoost’ do.

There are a variety of techniques to handle categorical data which I will be discussing in this article with their advantages and disadvantages.

Identifying Categorical Variables (Types):

Two major types of categorical features are

Nominal – These are variables which are not related to each other in any order such as colour (black, blue, green).

Ordinal – These are variables where a certain order can be found between them such as student grades (A, B, C, D, Fail).

The dataset I’m going to work with is: Melbourne housing price dataset from Kaggle https://www.kaggle.com/dansbecker/melbourne-housing-snapshot/home?select=melb_data.csv

Let’s Explore the Dataset

import pandas as pd
df = pd.read_csv('/content/drive/My Drive/melb_data.csv')
df.head()

The dataset contains 13580 rows and 21 columns.

Let’s get the categorical data out of training data and print the list. The object dtype indicates a column has text.

s = (df.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables:")
print(object_cols)

Categorical variables:['Suburb', 'Address', 'Type', 'Method', 'SellerG', 'Date', 'CouncilArea', 'Regionname']

For simplicity, I’ve taken up only 3 categorical columns to illustrate encoding techniques.

features = df[['Type','Method','Regionname']]
features.head()

Handling Categorical Variables

Replace using the map function

This could be a very basic approach to manually replace categorical values with custom values.

mapping = {'h':1,
           'u':2,
           't':3
          }
features['type'] = features.Type.map(mapping) 
features.type.value_counts()

OUTPUT

1    9449
2    3017
3    1114
Name: type, dtype: int64

This is not a good approach when it comes to large categories.

Label Encoding

Label encoding can uniquely number the different categories from 0 to n-1. Thus also termed as Integer encoding. LabelEncoder class from the scikit-learn library is used for this purpose.

Before Label Encoding:

features.Regionname.value_counts()

Southern Metropolitan         4695
Northern Metropolitan         3890
Western Metropolitan          2948
Eastern Metropolitan          1471
South-Eastern Metropolitan     450
Eastern Victoria                53
Northern Victoria               41
Western Victoria                32
Name: Regionname, dtype: int64

After label Encoding:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df1 = features[['Regionname']]
df1['Region'] = le.fit_transform(features['Regionname'])
df1.value_counts()
OUTPUT:
Regionname                  Region
Southern Metropolitan       5         4695
Northern Metropolitan       2         3890
Western Metropolitan        6         2948
Eastern Metropolitan        0         1471
South-Eastern Metropolitan  4          450
Eastern Victoria            1           53
Northern Victoria           3           41
Western Victoria            7           32
dtype: int64

fit_transform(y) – fits the label encoder and then returns encoded labels.

LabelEncoder cannot handle missing values so it’s important to impute them. LabelEncoder can be used to store values using less disk space. This is simple to use and works well on tree-based algorithms. It cannot work for linear models, SVMs, or neural networks as their data needs to be standardized.

One Hot Encoding

One hot encoding is a binary encoding applied to categorical values. To increase performance one can also first perform label encoding then those integer variables to binary values which will become the most desired form of machine-readable.

Pandas get_dummies() converts categorical variables into dummy/indicator variables.

Before OHE:

features.Method.value_counts()
S     9022
SP    1703
PI    1564
VB    1199
SA      92
Name: Method, dtype: int64

After OHE:

df2 = pd.get_dummies(features['Method'])
df2.value_counts()
OUTPUT:
PI  S  SA  SP  VB
0   1  0   0   0     9022
    0  0   1   0     1703
1   0  0   0   0     1564
0   0  0   0   1     1199
       1   0   0       92
dtype: int64

One hot encoding overcomes the limitations of label encoding and can be used in both tree-based and non-tree-based machine learning algorithms. The disadvantage is that for high cardinality, the feature space can really blow up quickly. The binary variables are often called “dummy variables” in statistics.

Label Binarizer

Scikit-learn also supports binary encoding by using the LabelBinarizer. We use a similar process as above to transform the data for the process of creating a pandas DataFrame.

from sklearn.preprocessing import LabelBinarizer
lb_style = LabelBinarizer()
lb_results = lb_style.fit_transform(features["Type"])
pd.DataFrame(lb_results, columns=lb_style.classes_).value_counts()
OUTPUT:
h  t  u
1  0  0    9449
0  0  1    3017
   1  0    1114
dtype: int64

Count/Frequency Encoder

Another way to refer to variables that have a multitude of categories is to call them variables with high cardinality. If we have categorical variables containing many multiple labels or high cardinality, then by using one-hot encoding, we will expand the feature space dramatically. Replacing categorical variables with their frequency this is the number of times each label appears in the dataset.

Before encoding:

20 h 9449
21 u 1114
22 t 3017

After encoding:

df_frequency_map = features.Type.value_counts().to_dict()
features.Type = features.Type.map(df_frequency_map)
features.Type.iloc[20:23]
OUTPUT:
20    9449
21    1114
22    3017
Name: Type, dtype: int64

It is very simple to implement and does not increase the feature dimensional space. But if some of the labels have the same count, then they will be replaced with the same count and they will lose some valuable information. Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power.

Conclusion

There are many more ways by which categorical variables can be changed to numeric I’ve discussed some of the important and commonly used ones. Handling categorical variables is an important step for feature engineering. New variables can be formed by categorical variables and get more insight into the dataset.

The complete code of the above implementation is available in the AIM’s GitHub repository. Please visit this link to find the notebook of the above codes.

Access all our open Survey & Awards Nomination forms in one place