Now Reading
A Complete Guide to Categorical Data Encoding

A Complete Guide to Categorical Data Encoding

In the field of data science, before going for the modelling, data preparation is a mandatory task. There are various tasks we require to perform in the data preparation. Encoding categorical data is one of such tasks which is considered crucial. As we know, most of the data in real life come with categorical string values and most of the machine learning models work with integer values only and some with other different values which can be understandable for the model. All models basically perform mathematical operations which can be performed using different tools and techniques. But the harsh truth is that mathematics is totally dependent on numbers. So in short we can say most of the models require numbers as the data, not strings or not anything else and these numbers can be float or integer. 

Encoding categorical data is a process of converting categorical data into integer format so that the data with converted categorical values can be provided to the models to give and improve the predictions. In this article, we will discuss categorical data encoding and we will try to understand why we need the process of categorical data encoding. The following are the important points that we will discuss in this article. 

Register for FREE Workshop on Data Engineering>>

Table of contents

  1. What is Categorical Data?
  2. Label Encoding or Ordinal Encoding
  3. One-Hot Encoding
  4. Effect Encoding
  5. Binary Encoding
  6. Base-N Encoding
  7. Hash Encoding
  8. Target Encoding

What is Categorical Data? 

As the whole discussion in this article is based on working with the categorical data, we should begin with understanding the categorical data. We can say that the data consisting of finite possible values can be considered categorical data.

Categorical data can be considered as gathered information that is divided into groups. For example, a list of many people with their blood group: A+, A-, B+, B-, AB+, AB-,O+, O- etc. in which each of the blood types is a categorical value.

There can be two kinds of categorical data:

  • Nominal data
  • Ordinal data

Nominal data: This type of categorical data consists of the name variable without any numerical values. For example, in any organization, the name of the different departments like research and development department, human resource department, accounts and billing department etc.

Above we can see some examples of nominal data.

Ordinal data: This type of categorical data consists of a set of orders or scales. For example, a list of patients consists of the level of sugar present in the body of a person which can be divided into high, low and medium classes.

Above we can see some examples of ordinal data.

Label Encoding or Ordinal Encoding

This type of encoding is used when the variables in the data are ordinal, ordinal encoding converts each label into integer values and the encoded data represents the sequence of labels.

The way to achieve this in python is illustrated below.


import category_encoders as ce
import pandas as pd

df=pd.DataFrame({'height':['tall','medium','short','tall','medium','short','tall','medium','short',]})


# create object of Ordinalencoding


encoder= ce.OrdinalEncoder(cols=['height'],return_df=True,
                           mapping=[{'col':'height',
'mapping':{'None':0,'tall':1,'medium':2,'short':3}}])

#Original data

print(df)
df['transformed'] = encoder.fit_transform(df)

print(df)

Output:

One-Hot Encoding 

In One-Hot Encoding, each category of any categorical variable gets a new variable. It maps each category with binary numbers (0 or 1). This type of encoding is used when the data is nominal. Newly created binary features can be considered dummy variables. After one hot encoding, the number of dummy variables depends on the number of categories presented in the data.

The way to achieve this in python is illustrated below.

df=pd.DataFrame({'name':[
'rahul','ashok','ankit','aditya','yash','vipin','amit'
]})

encoder=ce.OneHotEncoder(cols='name',handle_unknown='return_nan',return_df=True,use_cat_names=True)

#Original Data
print(df)

#Fit and transform Data
df_encoded = encoder.fit_transform(df)
print(df_encoded)

Output:

Here in the above output, we can see dummy variables for every category.

Effect Encoding

In this type of encoding, encoders provide values to the categories in -1,0,1 format. -1 formation is the only difference between One-Hot encoding and effect encoding.

After implementing this, we can understand it properly.

data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad']}) 

encoder=ce.sum_coding.SumEncoder(cols='City',verbose=False,)
#Original Data

data

df=encoder.fit_transform(data)

df

Output:

Here in the above output, we can see that the encoder has given -1 to Bangalore in every dummy variable. This is how a dummy variable is generated by the consists of -1,0 and 1 as an encoded category.

Hash Encoder

Just like One-Hot encoding, the hash encoder converts the category into binary numbers using new data variables but here we can fix the number of new data variables. Before going to the implementation we should know about hashing. So hashing is used for the transformation of arbitrary size input in the form of a fixed-size value.

Implementation of Hash Encoding in Python

data=pd.DataFrame({'Month':['January','April','March','April','Februay','June','July','June','September']})

#Create object for hash encoder
encoder=ce.HashingEncoder(cols='Month',n_components=6)#Fit and Transform Data
encoder.fit_transform(data)

Output:

Hashing is a one-way technique of encoding which is unlike other encoders. The Hash encoder’s output can not be converted again into the input. That is why we can say it may cause loss of information from the data. It should be applied with high dimension data in terms of categorical values.

In the above example, we implemented a hashing encoder for 6 dummy variables instead of 8 dummy variables just by using n_encoding = 6.

Binary Encoding

In the hash encoding, we have seen that using hashing can cause the loss of data and on the other hand we have seen in one hot encoding dimensionality of the data is increasing. The binary encoding is a process where we can perform hash encoding look like encoding without losing the information just like one hot encoding.

Basically, we can say that binary encoding is a combination process of hash and one hot encoding.

After implementation, we can see the basic difference between binary and hash and one hot encoding.

encoder= ce.BinaryEncoder(cols=['Month'],return_df=True)
data=encoder.fit_transform(data) 
data

Output:

See Also

Here in the output, we can see that without losing much information we have got encoded data with reduced dimensionality than the One-Hot encoding. This encoding is very helpful in the case of data with a huge amount of categories.

Base-N Encoding

In a positional number system, base or radix is the number of unique digits including zero used to represent numbers. In base n encoding if the base is two then the encoder will convert categories into the numerical form using their respective binary form which is formally one-hot encoding. But if we change the base to 10 which means the categories will get converted into numeric form between 0-9. By implementation, we can understand it more.



#Create an object for Base N Encoding
encoder= ce.BaseNEncoder(cols=['Month'],return_df=True,base=5)

#Fit and Transform Data
data_encoded=encoder.fit_transform(data)

data_encoded

Output:

In the above output, we can see that we have used base 5. Somewhere it is pretty simple to the binary encoding but where in binary we have got 4 dimensions after conversion here we have 3 dimensions only and also the numbers are varying between 0-4. 

If we do not define the base by default it is set to 2 which basically performs the binary encoding.

Target Encoding

Target encoding is the method of converting a categorical value into the mean of the target variable. This type of encoding is a type of bayesian encoding method where bayesian encoders use target variables to encode the categorical value.

The target encoding encoder calculates the mean of the target variable for each category and by the mean, the categories get replaced. Let’s see how it works.



df=pd.DataFrame({'name':[
'rahul','ashok','ankit','rahul','ashok','ankit'
],'marks' : [10,20,30,60,70,80,]})

df

Output:

Target encoding.

#Create target encoding object
encoder=ce.TargetEncoder(cols='name') 

#Fit and Transform Train Data
encoder.fit_transform(df['name'],df['marks'])

Output:

Here we can see the names of students are changed with the mean of their marks. This is a good method for encoding: using this we can encode any number of categories. But it can cause the overfitting of any model because we are using mean as a category and this generates a hard correlation between these features.

Using this we can train the model but in testing, it can lead to the failure or inaccuracy of the model.

Final words

In summary, we know that encoding is a crucial part of machine learning. In real-world problems, most of the time we require choosing one encoding method for the proper working of the model. Working with different encoders can vary the results of the model. In the article, we have seen various encoding methods and how we can implement them using python and the category_encoders library.

References:

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top