In the field of data science, before going for the modelling, data preparation is a mandatory task. There are various tasks we require to perform in the data preparation. Encoding categorical data is one of such tasks which is considered crucial. As we know, most of the data in real life come with categorical string values and most of the machine learning models work with integer values only and some with other different values which can be understandable for the model. All models basically perform mathematical operations which can be performed using different tools and techniques. But the harsh truth is that mathematics is totally dependent on numbers. So in short we can say most of the models require numbers as the data, not strings or not anything else and these numbers can be float or integer.

Encoding categorical data is a process of converting categorical data into integer format so that the data with converted categorical values can be provided to the models to give and improve the predictions. In this article, we will discuss categorical data encoding and we will try to understand why we need the process of categorical data encoding. The following are the important points that we will discuss in this article.

Register for FREE Workshop on Data Engineering>>

**Table of contents**

- What is Categorical Data?
- Label Encoding or Ordinal Encoding
- One-Hot Encoding
- Effect Encoding
- Binary Encoding
- Base-N Encoding
- Hash Encoding
- Target Encoding

**What is Categorical Data? **

As the whole discussion in this article is based on working with the categorical data, we should begin with understanding the categorical data. We can say that the data consisting of finite possible values can be considered categorical data.

Categorical data can be considered as gathered information that is divided into groups. For example, a list of many people with their blood group: A+, A-, B+, B-, AB+, AB-,O+, O- etc. in which each of the blood types is a categorical value.

There can be two kinds of categorical data:

- Nominal data
- Ordinal data

**Nominal data:** This type of categorical data consists of the name variable without any numerical values. For example, in any organization, the name of the different departments like research and development department, human resource department, accounts and billing department etc.

Above we can see some examples of nominal data.

**Ordinal data:** This type of categorical data consists of a set of orders or scales. For example, a list of patients consists of the level of sugar present in the body of a person which can be divided into high, low and medium classes.

Above we can see some examples of ordinal data.

**Label Encoding or Ordinal Encoding**

This type of encoding is used when the variables in the data are ordinal, ordinal encoding converts each label into integer values and the encoded data represents the sequence of labels.

The way to achieve this in python is illustrated below.

```
import category_encoders as ce
import pandas as pd
df=pd.DataFrame({'height':['tall','medium','short','tall','medium','short','tall','medium','short',]})
# create object of Ordinalencoding
encoder= ce.OrdinalEncoder(cols=['height'],return_df=True,
mapping=[{'col':'height',
'mapping':{'None':0,'tall':1,'medium':2,'short':3}}])
#Original data
print(df)
df['transformed'] = encoder.fit_transform(df)
print(df)
```

Output:

**One-Hot Encoding**

In One-Hot Encoding, each category of any categorical variable gets a new variable. It maps each category with binary numbers (0 or 1). This type of encoding is used when the data is nominal. Newly created binary features can be considered dummy variables. After one hot encoding, the number of dummy variables depends on the number of categories presented in the data.

The way to achieve this in python is illustrated below.

```
df=pd.DataFrame({'name':[
'rahul','ashok','ankit','aditya','yash','vipin','amit'
]})
encoder=ce.OneHotEncoder(cols='name',handle_unknown='return_nan',return_df=True,use_cat_names=True)
#Original Data
print(df)
#Fit and transform Data
df_encoded = encoder.fit_transform(df)
print(df_encoded)
```

Output:

Here in the above output, we can see dummy variables for every category.

**Effect Encoding**

In this type of encoding, encoders provide values to the categories in -1,0,1 format. -1 formation is the only difference between One-Hot encoding and effect encoding.

After implementing this, we can understand it properly.

```
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad']})
encoder=ce.sum_coding.SumEncoder(cols='City',verbose=False,)
#Original Data
data
df=encoder.fit_transform(data)
df
```

Output:

Here in the above output, we can see that the encoder has given -1 to Bangalore in every dummy variable. This is how a dummy variable is generated by the consists of -1,0 and 1 as an encoded category.

**Hash Encoder**

Just like One-Hot encoding, the hash encoder converts the category into binary numbers using new data variables but here we can fix the number of new data variables. Before going to the implementation we should know about hashing. So hashing is used for the transformation of arbitrary size input in the form of a fixed-size value.

Implementation of Hash Encoding in Python

```
data=pd.DataFrame({'Month':['January','April','March','April','Februay','June','July','June','September']})
#Create object for hash encoder
encoder=ce.HashingEncoder(cols='Month',n_components=6)#Fit and Transform Data
encoder.fit_transform(data)
```

Output:

Hashing is a one-way technique of encoding which is unlike other encoders. The Hash encoderâ€™s output can not be converted again into the input. That is why we can say it may cause loss of information from the data. It should be applied with high dimension data in terms of categorical values.

In the above example, we implemented a hashing encoder for 6 dummy variables instead of 8 dummy variables just by using n_encoding = 6.

**Binary Encoding**

In the hash encoding, we have seen that using hashing can cause the loss of data and on the other hand we have seen in one hot encoding dimensionality of the data is increasing. The binary encoding is a process where we can perform hash encoding look like encoding without losing the information just like one hot encoding.

Basically, we can say that binary encoding is a combination process of hash and one hot encoding.

After implementation, we can see the basic difference between binary and hash and one hot encoding.

```
encoder= ce.BinaryEncoder(cols=['Month'],return_df=True)
data=encoder.fit_transform(data)
data
```

Output:

Here in the output, we can see that without losing much information we have got encoded data with reduced dimensionality than the One-Hot encoding. This encoding is very helpful in the case of data with a huge amount of categories.

**Base-N Encoding**

In a positional number system, base or radix is the number of unique digits including zero used to represent numbers. In base n encoding if the base is two then the encoder will convert categories into the numerical form using their respective binary form which is formally one-hot encoding. But if we change the base to 10 which means the categories will get converted into numeric form between 0-9. By implementation, we can understand it more.

```
#Create an object for Base N Encoding
encoder= ce.BaseNEncoder(cols=['Month'],return_df=True,base=5)
#Fit and Transform Data
data_encoded=encoder.fit_transform(data)
data_encoded
```

Output:

In the above output, we can see that we have used base 5. Somewhere it is pretty simple to the binary encoding but where in binary we have got 4 dimensions after conversion here we have 3 dimensions only and also the numbers are varying between 0-4.

If we do not define the base by default it is set to 2 which basically performs the binary encoding.

**Target Encoding**

Target encoding is the method of converting a categorical value into the mean of the target variable. This type of encoding is a type of bayesian encoding method where bayesian encoders use target variables to encode the categorical value.

The target encoding encoder calculates the mean of the target variable for each category and by the mean, the categories get replaced. Letâ€™s see how it works.

```
df=pd.DataFrame({'name':[
'rahul','ashok','ankit','rahul','ashok','ankit'
],'marks' : [10,20,30,60,70,80,]})
df
```

Output:

Target encoding.

```
#Create target encoding object
encoder=ce.TargetEncoder(cols='name')
#Fit and Transform Train Data
encoder.fit_transform(df['name'],df['marks'])
```

Output:

Here we can see the names of students are changed with the mean of their marks. This is a good method for encoding: using this we can encode any number of categories. But it can cause the overfitting of any model because we are using mean as a category and this generates a hard correlation between these features.

Using this we can train the model but in testing, it can lead to the failure or inaccuracy of the model.

**Final words**

In summary, we know that encoding is a crucial part of machine learning. In real-world problems, most of the time we require choosing one encoding method for the proper working of the model. Working with different encoders can vary the results of the model. In the article, we have seen various encoding methods and how we can implement them using python and the category_encoders library.

References:

- Categorical_encoders.
- Complete Guide To Handling Categorical Data Using Scikit-Learn.
- Google Colab notebook for codes.

## Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.Join our Telegram Group. Be part of an engaging community

Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.