MITB Banner

Tutorial On Missingno – Python Tool To Visualize Missing Values

The purpose of this article is to get a better understanding of missing data by visualizing them using Missingno.

Share

Individuals working in the field of Data Science understand the importance of data. Data is the resource to fuel a machine learning model. But raw data in the real world cannot be used without pre-processing them to a usable format. One of the most common problems faced with real-time data is missing values. There are some values in rows and columns that simply do not exist. But, for a good model training, we need the data to be as clean as possible.

Missing values are generally represented with NaN which stands for Not a Number. Although Pandas library provides methods to impute values to these missing rows and columns, we need to be able to understand how, where and how many points of NaN are distributed in the dataset. For this, python introduced a new library called Missingno.

The purpose of this article is to get a better understanding of missing data by visualizing them using Missingno. 

What is Missingno?

Missingno is a Python library that provides the ability to understand the distribution of missing values through informative visualizations. The visualizations can be in the form of heat maps or bar charts. With this library, it is possible to observe where the missing values have occurred and to check the correlation of the columns containing the missing with the target column. Missing values are better handled once the dataset is fully explored. Let us now implement this and find out how it helps us pre-process the data better. 

Implementation of Missingno

The first step in implementing this is to install the library using the pip command as follows:

pip install missingno

Once this is installed, let us select a dataset that contains missing values. I have selected a dataset from Kaggle called Life expectancy dataset. This dataset is used to estimate the average human life expectancy based on the geographical location, health expenditure, disease etc. To download this dataset click here.  

Loading the dataset

Let us now import some of the libraries and load our dataset. 

from google.colab import drive
drive.mount('/content/gdrive')
import numpy as np
import pandas as pd
life_expentancy = pd.read_csv("/content/gdrive/My Drive/Life Expectancy Data.csv")
life_expentancy.head()
missing values

Now, let us identify the sum of the missing values using the isnull method of pandas. 

life_expectancy.isna().sum()

missing values

Now, we can identify that there are values which are missing. It is time to now visualize this using the library. 

Visualization of missing values

  1. Matrix

import missingno as msno

msno.matrix(life_expectancy)

The dataset is distributed from 1 to 2938 data points. The white lines indicate the missing values in each column. The Hepatitis B, population and GDP columns seem to have the highest number of missing values. Other than this, on closer observation, you can notice that there are few trends in the missing rows and columns. For example, if a row value is missing from the BMI column there is also the same rows missing from the thinness 1-19 years column. Another trend is that if there are values missing from the GDP column, then the income column is also missing those rows. These trends give an idea about how the features are correlated with one another. But to get a better idea about correlations we need to use heatmaps.

  1. Heatmap

msno.heatmap(life_expectancy)

missingno

The heatmap shows a positive correlation with blue. The darker the shade of blue, the more the correlation. The map shows that the total expenditure and alcohol have the highest correlation of 0.9. It also shows that the GDP and income column are positively correlated as per our initial intuition which means these two columns can affect the target. 

Another way to visualize the data for missing values is by using bar plots. 

  1. Bar Plot

msno.bar(life_expectancy)

These bars show the values that are proportional to the non-missing data in the dataset. Along with that, the number of values missing is also shown. Since the total number of datapoints is 2938, the columns with lesser than these contain missing values. 

Conclusion

In this article, we saw how to visualize the missing data in a graphical format and understand the relationship that exists among the different columns. Missingno helps in understanding the structure of the dataset with very few lines of code. With this information, it becomes easier and more efficient to use pandas either to impute the values or to drop them and can increase the overall accuracy of the model. 

Share
Picture of Bhoomika Madhukar

Bhoomika Madhukar

I am an aspiring data scientist with a passion for teaching. I am a computer science graduate from Dayananda Sagar Institute. I have experience in building models in deep learning and reinforcement learning. My goal is to use AI in the field of education to make learning meaningful for everyone.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.