Tutorial On Missingno – Python Tool To Visualize Missing Values

The purpose of this article is to get a better understanding of missing data by visualizing them using Missingno.

Individuals working in the field of Data Science understand the importance of data. Data is the resource to fuel a machine learning model. But raw data in the real world cannot be used without pre-processing them to a usable format. One of the most common problems faced with real-time data is missing values. There are some values in rows and columns that simply do not exist. But, for a good model training, we need the data to be as clean as possible.

Missing values are generally represented with NaN which stands for Not a Number. Although Pandas library provides methods to impute values to these missing rows and columns, we need to be able to understand how, where and how many points of NaN are distributed in the dataset. For this, python introduced a new library called Missingno.

The purpose of this article is to get a better understanding of missing data by visualizing them using Missingno. 

What is Missingno?

Missingno is a Python library that provides the ability to understand the distribution of missing values through informative visualizations. The visualizations can be in the form of heat maps or bar charts. With this library, it is possible to observe where the missing values have occurred and to check the correlation of the columns containing the missing with the target column. Missing values are better handled once the dataset is fully explored. Let us now implement this and find out how it helps us pre-process the data better. 

Implementation of Missingno

The first step in implementing this is to install the library using the pip command as follows:

pip install missingno

Once this is installed, let us select a dataset that contains missing values. I have selected a dataset from Kaggle called Life expectancy dataset. This dataset is used to estimate the average human life expectancy based on the geographical location, health expenditure, disease etc. To download this dataset click here.  

Loading the dataset

Let us now import some of the libraries and load our dataset. 

from google.colab import drive
drive.mount('/content/gdrive')
import numpy as np
import pandas as pd
life_expentancy = pd.read_csv("/content/gdrive/My Drive/Life Expectancy Data.csv")
life_expentancy.head()
missing values

Now, let us identify the sum of the missing values using the isnull method of pandas. 

life_expectancy.isna().sum()

missing values

Now, we can identify that there are values which are missing. It is time to now visualize this using the library. 

Visualization of missing values

  1. Matrix

import missingno as msno

msno.matrix(life_expectancy)

The dataset is distributed from 1 to 2938 data points. The white lines indicate the missing values in each column. The Hepatitis B, population and GDP columns seem to have the highest number of missing values. Other than this, on closer observation, you can notice that there are few trends in the missing rows and columns. For example, if a row value is missing from the BMI column there is also the same rows missing from the thinness 1-19 years column. Another trend is that if there are values missing from the GDP column, then the income column is also missing those rows. These trends give an idea about how the features are correlated with one another. But to get a better idea about correlations we need to use heatmaps.

  1. Heatmap

msno.heatmap(life_expectancy)

missingno

The heatmap shows a positive correlation with blue. The darker the shade of blue, the more the correlation. The map shows that the total expenditure and alcohol have the highest correlation of 0.9. It also shows that the GDP and income column are positively correlated as per our initial intuition which means these two columns can affect the target. 

Another way to visualize the data for missing values is by using bar plots. 

  1. Bar Plot

msno.bar(life_expectancy)

These bars show the values that are proportional to the non-missing data in the dataset. Along with that, the number of values missing is also shown. Since the total number of datapoints is 2938, the columns with lesser than these contain missing values. 

Conclusion

In this article, we saw how to visualize the missing data in a graphical format and understand the relationship that exists among the different columns. Missingno helps in understanding the structure of the dataset with very few lines of code. With this information, it becomes easier and more efficient to use pandas either to impute the values or to drop them and can increase the overall accuracy of the model. 

More Great AIM Stories

Bhoomika Madhukar
I am an aspiring data scientist with a passion for teaching. I am a computer science graduate from Dayananda Sagar Institute. I have experience in building models in deep learning and reinforcement learning. My goal is to use AI in the field of education to make learning meaningful for everyone.

More Stories

OUR UPCOMING EVENTS

8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

MORE FROM AIM
A beginner’s guide to Spatio-Temporal graph neural networks

Spatio-temporal graphs are made of static structures and time-varying features, and such information in a graph requires a neural network that can deal with time-varying features of the graph. Neural networks which are developed to deal with time-varying features of the graph can be considered as Spatio-temporal graph neural networks. 

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM