Individuals working in the field of Data Science understand the importance of data. Data is the resource to fuel a machine learning model. But raw data in the real world cannot be used without pre-processing them to a usable format. One of the most common problems faced with real-time data is missing values. There are some values in rows and columns that simply do not exist. But, for a good model training, we need the data to be as clean as possible.
Missing values are generally represented with NaN which stands for Not a Number. Although Pandas library provides methods to impute values to these missing rows and columns, we need to be able to understand how, where and how many points of NaN are distributed in the dataset. For this, python introduced a new library called Missingno.
The purpose of this article is to get a better understanding of missing data by visualizing them using Missingno.
What is Missingno?
Missingno is a Python library that provides the ability to understand the distribution of missing values through informative visualizations. The visualizations can be in the form of heat maps or bar charts. With this library, it is possible to observe where the missing values have occurred and to check the correlation of the columns containing the missing with the target column. Missing values are better handled once the dataset is fully explored. Let us now implement this and find out how it helps us pre-process the data better.
Implementation of Missingno
The first step in implementing this is to install the library using the pip command as follows:
pip install missingno
Once this is installed, let us select a dataset that contains missing values. I have selected a dataset from Kaggle called Life expectancy dataset. This dataset is used to estimate the average human life expectancy based on the geographical location, health expenditure, disease etc. To download this dataset click here.
Loading the dataset
Let us now import some of the libraries and load our dataset.
from google.colab import drive drive.mount('/content/gdrive') import numpy as np import pandas as pd life_expentancy = pd.read_csv("/content/gdrive/My Drive/Life Expectancy Data.csv") life_expentancy.head()
Now, let us identify the sum of the missing values using the isnull method of pandas.
Now, we can identify that there are values which are missing. It is time to now visualize this using the library.
Visualization of missing values
import missingno as msno
The dataset is distributed from 1 to 2938 data points. The white lines indicate the missing values in each column. The Hepatitis B, population and GDP columns seem to have the highest number of missing values. Other than this, on closer observation, you can notice that there are few trends in the missing rows and columns. For example, if a row value is missing from the BMI column there is also the same rows missing from the thinness 1-19 years column. Another trend is that if there are values missing from the GDP column, then the income column is also missing those rows. These trends give an idea about how the features are correlated with one another. But to get a better idea about correlations we need to use heatmaps.
The heatmap shows a positive correlation with blue. The darker the shade of blue, the more the correlation. The map shows that the total expenditure and alcohol have the highest correlation of 0.9. It also shows that the GDP and income column are positively correlated as per our initial intuition which means these two columns can affect the target.
Another way to visualize the data for missing values is by using bar plots.
- Bar Plot
These bars show the values that are proportional to the non-missing data in the dataset. Along with that, the number of values missing is also shown. Since the total number of datapoints is 2938, the columns with lesser than these contain missing values.
In this article, we saw how to visualize the missing data in a graphical format and understand the relationship that exists among the different columns. Missingno helps in understanding the structure of the dataset with very few lines of code. With this information, it becomes easier and more efficient to use pandas either to impute the values or to drop them and can increase the overall accuracy of the model.