Tips for Automating EDA using Pandas Profiling, Sweetviz and Autoviz in Python

This article explores EDA Automation using Pandas Profiling, Sweetviz and Autoviz in the task of Creating Detailed EDA Reports, Creating reports for comparing 2 Datasets, and Visualizing the dataset.
Automating EDA

Advertisement

Exploratory Data Analysis (EDA) is used to explore different aspects of the data we are working on. EDA should be performed in order to find the patterns, visual insights, etc. that the data set is having, before creating a model or predicting something through the dataset. EDA is a general approach of identifying characteristics of the data we are working on by visualizing the dataset. EDA is performed to visualize what data is telling us before implementing any formal modelling or creating a hypothesis testing model.

Analyzing a dataset is a hectic task and takes a lot of time, according to a study EDA takes around 30% effort of the project but it cannot be eliminated. Python provides certain open-source modules that can automate the whole process of EDA and save a lot of time. Some of these popular modules that we are going to explore are:-

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.
  • Pandas Profiling
  • Sweetviz
  • Autoviz

Using these above modules, we will be covering the following EDA aspects in this article:-

  • Creating Detailed EDA Reports
  • Creating reports for comparing 2 Datasets
  • Visualizing the dataset.

1. Pandas Profiling

Pandas Profiling is a python library that not only automates the EDA process but also creates a detailed EDA report in just a few lines of code. Pandas Profiling can be used easily for large datasets as it is blazingly fast and creates reports in a few seconds.

Here we will work on a dataset that contains the Car Design Data and can be downloaded from Kaggle. This data contains around 205 rows and 26 Columns. Analyzing it manually will take a lot of time. Let us see how we can Analyze this data using pandas-profiling.

Implementation

In order to use pandas profiling, we first need to install it by using pip install pandas-profiling

We will start by importing important libraries we will be using and the data we will be working on.

import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv(“car_design.csv”)
df
Automating EDA

After loading the dataset we just need to run the following commands to generate and download the EDA report.

design_report = ProfileReport(df)
design_report.to_file(output_file='report.html')

After we run these commands, it will create a detailed EDA report and save it as an HTML file with the name ’report.html’ or any name which you pass as an argument. 

Understanding the Report

The report generated contains a general overview and different sections for different characteristics of attributes of the dataset. The different sections are:

A. Overview

Automating EDA

B. Variable Properties

We can scroll down to see all the variables in the dataset and their properties.

Automating EDA

C. Interaction of Variables

Similarly, we can also view the interaction of different attributes of the dataset with each other.

D. Correlations of the variable

The report generated contains different types of correlations like Spearman’s, Kendall’s, etc. of all the attributes of the dataset.

Automating EDA

E. Missing Values

Other than this the report also shows which attributes have missing values.

The report generated is really helpful in identifying patterns in the data and finding out the characteristics of the data. 

2. Sweetviz

Sweetviz is a python library that focuses on exploring the data with the help of beautiful and high-density visualizations. It not only automates the EDA but is also used for comparing datasets and drawing inferences from it.

Here we will analyze the same dataset as we used for pandas profiling.

Implementation

Before using sweetviz we need to install it by using pip install sweetviz.

We have already loaded the dataset above in the variable named “df”, we will just import the dataset and create the EDA report in just a few lines of code.

import sweetviz as sv
sweet_report = sv.analyze(df)
sweet_report.show_html('sweet_report.html')

This step will generate the report and save it in a file named “sweet_report.html” which is user-defined.

Understanding the Report

The report contains characteristics of the different attributes along with visualization.

Automating EDA

In this report, we can clearly see what are the different attributes of the datasets and their characteristics including the missing values, distinct values, etc.

Sweetviz also allows you to compare two different datasets or the data in the same dataset by converting it into testing and training datasets. Below given command will allow us to visualize the dataset we are using by equally distributing it in testing and training data. 

df1 = sv.compare(df[102:], df[:102])
df1.show_html('Compare.html')

In this report, we can easily compare the data and the comparison between the datasets. Here we can see that the reports generated are easily understandable and are prepared in just 3 lines of code.

3. Autoviz

Autoviz is an open-source python library that mainly works on visualizing the relationship of the data, it can find the most impactful features and plot creative visualization in just one line of code. Autoviz is incredibly fast and highly useful.

Before Exploring Autoviz we need to install it by using pip install autoviz.

Implementation

For using autoviz first we need to import the autoviz class and instantiate it.

from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()


After initiating the Autoviz class we just need to run a command which will create a visualization of the dataset.

df = AV.AutoViz('car_design.csv')

Understanding the Report

The above command will create a report which will contain the following attributes:

A. Pairwise scatter plot of all continuous variables



B. Histograms(KDE Plots) of all continuous variables

C. Violin Plots of all continuous variables

D. Heatmap of continuous variables



If we know the dependent variable in the dataset which is dependent on other variables, then we can pass it as an argument and visualize the data according to the Dependent Variable. For eg. If we consider “highway-mpg” as a dependent variable then we will use the below-given command to visualize the data according to the dependent variable.

df = AV.AutoViz('car_design.csv', depVar='highway-mpg')

This will create the same report as we have seen above but in the context of the dependent variable i.e. highway-mpg.

Conclusion 

In this article, we have learned how we can automate the EDA process which is generally a time taking process. We have learned about three open-source python libraries which can be used for Automating, namely: Pandas-Profiling, Sweetviz, and Autoviz.  All the libraries are easy to use and create a detailed report about the different characteristics of data and visualization for correlations and comparisons.

More Great AIM Stories

Himanshu Sharma
An aspiring Data Scientist currently Pursuing MBA in Applied Data Science, with an Interest in the financial markets. I have experience in Data Analytics, Data Visualization, Machine Learning, Creating Dashboards and Writing articles related to Data Science.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MORE FROM AIM
Sreejani Bhattacharyya
Why is edtech falling first?

With the lockdown being imposed due to the COVID-19 pandemic and schools being shut down, the edtech startups witnessed some of their best times during 2020 and 2021.