MITB Banner

Beginners Guide to Pyjanitor – A Python Tool for Data Cleaning

This article deals with an overview of what pyjanitor is, how it works and a demonstration of using this package to clean dirty data.
Share

As a data scientist, you are more or less going to spend 60-70% of your time cleaning and preparing your data. The process of cleaning, encoding and transforming your raw data in order to bring them into a format that the machine learning model can understand is called Data Pre-processing. This process is often long and cumbersome and most developers consider it to be the least favourite part of a project. Despite being tedious, it is one of the most important techniques that need to be implemented. To simplify the overall process and make it a bit more interesting, python introduces a package called PyJanitor- A Python Tool for Data Cleaning.

This article deals with an overview of what pyjanitor is, how it works and a demonstration of using this package to clean dirty data. 

What is pyjanitor?

Initially developed in R as a Janitor library, it was developed in Python due to its convenience. Pyjanitor is an API that is written on top of the popular python library Pandas. Data pre-processing can be thought of as a directed acyclic graph where the starting node is raw data and we implement a series of techniques on this raw data to get usable data. Pandas has been a huge part of the data science ecosystem and pyjanitor API is implemented on pandas using a concept called method chaining. 

Method chaining can be understood as something similar to parallel processing. Instead of having an imperative style of programming as with pandas, method chaining combines multiple processes and allows the user to decide the order of the actions taken. Here is an example of method chaining.

If we were using Pandas to clean our data it would look something like this

data = pd.DataFrame(‘your dataset’) 
data = data.dropna(subset=['columnA', 'columnB']) 
data = data.rename({'columnA': 'apple', 'columnB': ‘banana’})  
data['new_column'] = ['iterable', 'of', 'items']  
data.reset_index(inplace=True, drop=True)

With pyjanitor, the functions are just verbs that help you perform the actions. 

data = (
    pd.DataFrame(‘your dataset’)
    .remove_columns(['columnC'])
    .dropna(subset=['columnA', 'columnB''])
    .rename_column('columnA', ‘apples’)
    .rename_column('columnB', ‘banana’)
    .add_column('new_column', ['iterable', 'of', 'items'])
    .reset_index(drop=True)
)

Using these functions, we pipeline them together. This method chaining helps in writing cleaner code and the function names are easier to remember, making the data cleaning much simpler. There are two advantages to using pyjanitor. One, it extends pandas with convenient data cleaning routines. Two, it provides a cleaner, method-chaining, verb-based API for common pandas routines.

How does it work?

The functions are written in the pandas-flavour package and are registered. These functions are registered without making any changes to pandas. You can add your own data processing functions depending on the data and use these methods whenever needed. Here is an example.

import pandas as pd
import pandas_flavor as pf
@pf.register_dataframe_method
def remove_column(dataframe, column_name: str):
    del df[column_name]
    return df

Once you import the janitor library, the above function is automatically registered and can be used. The fact that each DataFrame method pyjanitor registers return the DataFrame is what gives it the capability to method chain.

Demonstration of data cleaning with pyjanitor

In this demonstration, we will implement various data frame manipulation techniques on raw data. Before we start with cleaning, let us install the requirements. 

pip install pyjanitor

I have used Melbourne housing dataset that can be found here. Once you have installed the package and downloaded the data we can get started with data pre-processing. 

Here is what the raw data looks like. 

pyjanitor

As you can see there are missing values, whitespaces in column names and categorical columns. Using pyjanitor I will now pipeline the most obvious data processing into just one block of code. 

cleaned_df = (
pd.read_csv('/content/gdrive/MyDrive/melbourne/Melbourne_housing_FULL.csv')
    .clean_names()
    .remove_empty()
    .rename_column('price', "target")
    .encode_categorical(["method", "suburb", "regionname"])
    .fill_empty('buildingarea',value=1)
    .drop('date',axis=1)
   )

I have listed the methods to clean the column names. This removes the white spaces, capital letters and special characters from column names making it easy to access them. Then, a function to remove empty rows if any. 

To make it easy for identifying the features and target I renamed the price column to target. Next, I converted three columns namely ‘method’,’ suburb’ and ‘regionname’ to categorical values. The fill_empty method is used to fill your columns that contain NaN with any value of your choice. Finally, I decided to drop the Date column just for the purposes of this demonstration.

pyjanitor

Once you have pipelined and finished the first and obvious parts of processing your data, it becomes easier for you to identify the more subtle changes needed and where the changes need to be made. I will now show you how to register your custom function to the janitor. 

Let’s say you want to remove spaces in a column of your dataset. I have selected a column named council area. This is how the column looks like. 

To remove the space, I will write my custom function as follows and register it to the janitor library.

import pandas_flavor as pf
@pf.register_dataframe_method
def str_remove(df, column_name: str, pat: str, *args, **kwargs)
    df[column_name] = df[column_name].str.replace(pat, "", *args, **kwargs)
    return df

To register it, simply import janitor again and you can use it on your dataset. 

import janitor 
cleaned_df=(
.str_remove(column_name="councilarea",pat=' '))
)
pyjanitor

Just like that, we wrote a custom function to remove white spaces and this can be used for any dataset by anybody. There are 1000s of methods you can choose from for making the process of transforming the data easier. Here is a list of various methods that can be used.

Conclusion

With packages like pyjanitor, the amount of time and effort spent to clean and transform the data significantly reduces. Not only does it provide options for pipelining multiple methods but the option to create and register your own data cleaning techniques which can come in handy to other users as well. This is a new yet growing library in the field of data science and is helping data scientists perform cumbersome tasks in a much more efficient way. 

PS: The story was written using a keyboard.
Share
Picture of Bhoomika Madhukar

Bhoomika Madhukar

I am an aspiring data scientist with a passion for teaching. I am a computer science graduate from Dayananda Sagar Institute. I have experience in building models in deep learning and reinforcement learning. My goal is to use AI in the field of education to make learning meaningful for everyone.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India