Now Reading
Tips for Automating EDA using Pandas Profiling, Sweetviz and Autoviz in Python

Tips for Automating EDA using Pandas Profiling, Sweetviz and Autoviz in Python

Himanshu Sharma
Automating EDA
W3Schools

Exploratory Data Analysis (EDA) is used to explore different aspects of the data we are working on. EDA should be performed in order to find the patterns, visual insights, etc. that the data set is having, before creating a model or predicting something through the dataset. EDA is a general approach of identifying characteristics of the data we are working on by visualizing the dataset. EDA is performed to visualize what data is telling us before implementing any formal modelling or creating a hypothesis testing model.

Analyzing a dataset is a hectic task and takes a lot of time, according to a study EDA takes around 30% effort of the project but it cannot be eliminated. Python provides certain open-source modules that can automate the whole process of EDA and save a lot of time. Some of these popular modules that we are going to explore are:-

  • Pandas Profiling
  • Sweetviz
  • Autoviz

Using these above modules, we will be covering the following EDA aspects in this article:-



  • Creating Detailed EDA Reports
  • Creating reports for comparing 2 Datasets
  • Visualizing the dataset.

1. Pandas Profiling

Pandas Profiling is a python library that not only automates the EDA process but also creates a detailed EDA report in just a few lines of code. Pandas Profiling can be used easily for large datasets as it is blazingly fast and creates reports in a few seconds.

Here we will work on a dataset that contains the Car Design Data and can be downloaded from Kaggle. This data contains around 205 rows and 26 Columns. Analyzing it manually will take a lot of time. Let us see how we can Analyze this data using pandas-profiling.

Implementation

In order to use pandas profiling, we first need to install it by using pip install pandas-profiling

We will start by importing important libraries we will be using and the data we will be working on.

import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv(“car_design.csv”)
df
Automating EDA

After loading the dataset we just need to run the following commands to generate and download the EDA report.

design_report = ProfileReport(df)
design_report.to_file(output_file='report.html')

After we run these commands, it will create a detailed EDA report and save it as an HTML file with the name ’report.html’ or any name which you pass as an argument. 

Understanding the Report

The report generated contains a general overview and different sections for different characteristics of attributes of the dataset. The different sections are:

A. Overview

Automating EDA

B. Variable Properties

We can scroll down to see all the variables in the dataset and their properties.

Automating EDA

C. Interaction of Variables

Similarly, we can also view the interaction of different attributes of the dataset with each other.

D. Correlations of the variable

The report generated contains different types of correlations like Spearman’s, Kendall’s, etc. of all the attributes of the dataset.

Automating EDA

E. Missing Values

Other than this the report also shows which attributes have missing values.

The report generated is really helpful in identifying patterns in the data and finding out the characteristics of the data. 

2. Sweetviz

Sweetviz is a python library that focuses on exploring the data with the help of beautiful and high-density visualizations. It not only automates the EDA but is also used for comparing datasets and drawing inferences from it.

Here we will analyze the same dataset as we used for pandas profiling.

Implementation

Before using sweetviz we need to install it by using pip install sweetviz.

We have already loaded the dataset above in the variable named “df”, we will just import the dataset and create the EDA report in just a few lines of code.

import sweetviz as sv
sweet_report = sv.analyze(df)
sweet_report.show_html('sweet_report.html')

This step will generate the report and save it in a file named “sweet_report.html” which is user-defined.

Understanding the Report

The report contains characteristics of the different attributes along with visualization.

Automating EDA

In this report, we can clearly see what are the different attributes of the datasets and their characteristics including the missing values, distinct values, etc.

Sweetviz also allows you to compare two different datasets or the data in the same dataset by converting it into testing and training datasets. Below given command will allow us to visualize the dataset we are using by equally distributing it in testing and training data. 

See Also

df1 = sv.compare(df[102:], df[:102])
df1.show_html('Compare.html')

In this report, we can easily compare the data and the comparison between the datasets. Here we can see that the reports generated are easily understandable and are prepared in just 3 lines of code.

3. Autoviz

Autoviz is an open-source python library that mainly works on visualizing the relationship of the data, it can find the most impactful features and plot creative visualization in just one line of code. Autoviz is incredibly fast and highly useful.

Before Exploring Autoviz we need to install it by using pip install autoviz.

Implementation

For using autoviz first we need to import the autoviz class and instantiate it.

from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()


After initiating the Autoviz class we just need to run a command which will create a visualization of the dataset.

df = AV.AutoViz('car_design.csv')

Understanding the Report

The above command will create a report which will contain the following attributes:

A. Pairwise scatter plot of all continuous variables



B. Histograms(KDE Plots) of all continuous variables

C. Violin Plots of all continuous variables

D. Heatmap of continuous variables



If we know the dependent variable in the dataset which is dependent on other variables, then we can pass it as an argument and visualize the data according to the Dependent Variable. For eg. If we consider “highway-mpg” as a dependent variable then we will use the below-given command to visualize the data according to the dependent variable.

df = AV.AutoViz('car_design.csv', depVar='highway-mpg')

This will create the same report as we have seen above but in the context of the dependent variable i.e. highway-mpg.

Conclusion 

In this article, we have learned how we can automate the EDA process which is generally a time taking process. We have learned about three open-source python libraries which can be used for Automating, namely: Pandas-Profiling, Sweetviz, and Autoviz.  All the libraries are easy to use and create a detailed report about the different characteristics of data and visualization for correlations and comparisons.

What Do You Think?

If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top