Complete Guide To Handling Categorical Data Using Scikit-Learn

Dealing with categorical features is a common thing to preprocess before building machine learning models.There are a variety of techniques to handle categorical data which I will be discussing in this article with their advantages and disadvantages.

Dealing with categorical features is a common thing to preprocess before building machine learning models. In real-life data science scenario, it means that the dataset has an attribute stored as text such as days of the week(Monday, Tuesday,.., Sunday), time, colour(Red, Blue, …), or place names, etc. 

Categorical features have a lot to say about the dataset thus it should be converted to numerical to make it into a machine-readable format. Focusing only on numerical variables in the dataset isn’t enough to get good accuracy. Often categorical variables prove to be the most important factor and thus identify them for further analysis. Most of the machine learning algorithms do not support categorical data, only a few as ‘CatBoost’ do. 

There are a variety of techniques to handle categorical data which I will be discussing in this article with their advantages and disadvantages. 

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Identifying Categorical Variables (Types):

Two major types of categorical features are

  • Nominal – These are variables which are not related to each other in any order such as colour (black, blue, green).
  • Ordinal – These are variables where a certain order can be found between them such as student grades (A, B, C, D, Fail).

The dataset I’m going to work with is: Melbourne housing price dataset from Kaggle

Download our Mobile App

Let’s Explore the Dataset

import pandas as pd
df = pd.read_csv('/content/drive/My Drive/melb_data.csv')
categorical data

The dataset contains 13580 rows and 21 columns.

Let’s get the categorical data out of training data and print the list. The object dtype indicates a column has text.

s = (df.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables:")

Categorical variables:['Suburb', 'Address', 'Type', 'Method', 'SellerG', 'Date', 'CouncilArea', 'Regionname']

For simplicity, I’ve taken up only 3 categorical columns to illustrate encoding techniques. 

features = df[['Type','Method','Regionname']]
categorical data

Handling Categorical Variables

Replace using the map function

This could be a very basic approach to manually replace categorical values with custom values.

mapping = {'h':1,
features['type'] = 


1    9449
2    3017
3    1114
Name: type, dtype: int64

This is not a good approach when it comes to large categories.

Label Encoding

Label encoding can uniquely number the different categories from 0 to n-1. Thus also termed as Integer encoding. LabelEncoder class from the scikit-learn library is used for this purpose. 

Before Label Encoding:


Southern Metropolitan         4695
Northern Metropolitan         3890
Western Metropolitan          2948
Eastern Metropolitan          1471
South-Eastern Metropolitan     450
Eastern Victoria                53
Northern Victoria               41
Western Victoria                32
Name: Regionname, dtype: int64

After label Encoding:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df1 = features[['Regionname']]
df1['Region'] = le.fit_transform(features['Regionname'])
Regionname                  Region
Southern Metropolitan       5         4695
Northern Metropolitan       2         3890
Western Metropolitan        6         2948
Eastern Metropolitan        0         1471
South-Eastern Metropolitan  4          450
Eastern Victoria            1           53
Northern Victoria           3           41
Western Victoria            7           32
dtype: int64

fit_transform(y) – fits the label encoder and then returns encoded labels.

LabelEncoder cannot handle missing values so it’s important to impute them. LabelEncoder can be used to store values using less disk space. This is simple to use and works well on tree-based algorithms. It cannot work for linear models, SVMs, or neural networks as their data needs to be standardized.

One Hot Encoding

One hot encoding is a binary encoding applied to categorical values. To increase performance one can also first perform label encoding then those integer variables to binary values which will become the most desired form of machine-readable.

Pandas get_dummies() converts categorical variables into dummy/indicator variables.

Before OHE:

S     9022
SP    1703
PI    1564
VB    1199
SA      92
Name: Method, dtype: int64

After OHE:

df2 = pd.get_dummies(features['Method'])
0   1  0   0   0     9022
    0  0   1   0     1703
1   0  0   0   0     1564
0   0  0   0   1     1199
       1   0   0       92
dtype: int64

One hot encoding overcomes the limitations of label encoding and can be used in both tree-based and non-tree-based machine learning algorithms. The disadvantage is that for high cardinality, the feature space can really blow up quickly. The binary variables are often called “dummy variables” in statistics.

Label Binarizer

Scikit-learn also supports binary encoding by using the LabelBinarizer. We use a similar process as above to transform the data for the process of creating a pandas DataFrame.

from sklearn.preprocessing import LabelBinarizer
lb_style = LabelBinarizer()
lb_results = lb_style.fit_transform(features["Type"])
pd.DataFrame(lb_results, columns=lb_style.classes_).value_counts()
h  t  u
1  0  0    9449
0  0  1    3017
   1  0    1114
dtype: int64

Count/Frequency Encoder

Another way to refer to variables that have a multitude of categories is to call them variables with high cardinality. If we have categorical variables containing many multiple labels or high cardinality, then by using one-hot encoding, we will expand the feature space dramatically. Replacing categorical variables with their frequency this is the number of times each label appears in the dataset. 

Before encoding:

20 h 9449
21 u 1114
22 t 3017

After encoding:

df_frequency_map = features.Type.value_counts().to_dict()
features.Type =
20    9449
21    1114
22    3017
Name: Type, dtype: int64

It is very simple to implement and does not increase the feature dimensional space. But if some of the labels have the same count, then they will be replaced with the same count and they will lose some valuable information. Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power.


There are many more ways by which categorical variables can be changed to numeric I’ve discussed some of the important and commonly used ones. Handling categorical variables is an important step for feature engineering. New variables can be formed by categorical variables and get more insight into the dataset. 

The complete code of the above implementation is available in the AIM’s GitHub repository. Please visit this link to find the notebook of the above codes.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Jayita Bhattacharyya
Machine learning and data science enthusiast. Eager to learn new technology advances. A self-taught techie who loves to do cool stuff using technology for fun and worthwhile.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.