Exploratory Data Analysis (EDA) is used to explore different aspects of the data we are working on. EDA should be performed in order to find the patterns, visual insights, etc. that the data set is having, before creating a model or predicting something through the dataset. EDA is a general approach of identifying characteristics of the data we are working on by visualizing the dataset. EDA is performed to visualize what data is telling us before implementing any formal modelling or creating a hypothesis testing model.
Analyzing a dataset is a hectic task and takes a lot of time, according to a study EDA takes around 30% effort of the project but it cannot be eliminated. Python provides certain open-source modules that can automate the whole process of EDA and save a lot of time. Some of these popular modules that we are going to explore are:-
- Pandas Profiling
Using these above modules, we will be covering the following EDA aspects in this article:-
- Creating Detailed EDA Reports
- Creating reports for comparing 2 Datasets
- Visualizing the dataset.
1. Pandas Profiling
Pandas Profiling is a python library that not only automates the EDA process but also creates a detailed EDA report in just a few lines of code. Pandas Profiling can be used easily for large datasets as it is blazingly fast and creates reports in a few seconds.
Here we will work on a dataset that contains the Car Design Data and can be downloaded from Kaggle. This data contains around 205 rows and 26 Columns. Analyzing it manually will take a lot of time. Let us see how we can Analyze this data using pandas-profiling.
In order to use pandas profiling, we first need to install it by using pip install pandas-profiling
We will start by importing important libraries we will be using and the data we will be working on.
import pandas as pd from pandas_profiling import ProfileReport df = pd.read_csv(“car_design.csv”) df
After loading the dataset we just need to run the following commands to generate and download the EDA report.
design_report = ProfileReport(df) design_report.to_file(output_file='report.html')
After we run these commands, it will create a detailed EDA report and save it as an HTML file with the name ’report.html’ or any name which you pass as an argument.
Understanding the Report
The report generated contains a general overview and different sections for different characteristics of attributes of the dataset. The different sections are:
B. Variable Properties
We can scroll down to see all the variables in the dataset and their properties.
C. Interaction of Variables
Similarly, we can also view the interaction of different attributes of the dataset with each other.
D. Correlations of the variable
The report generated contains different types of correlations like Spearman’s, Kendall’s, etc. of all the attributes of the dataset.
E. Missing Values
Other than this the report also shows which attributes have missing values.
The report generated is really helpful in identifying patterns in the data and finding out the characteristics of the data.
Sweetviz is a python library that focuses on exploring the data with the help of beautiful and high-density visualizations. It not only automates the EDA but is also used for comparing datasets and drawing inferences from it.
Here we will analyze the same dataset as we used for pandas profiling.
Before using sweetviz we need to install it by using pip install sweetviz.
We have already loaded the dataset above in the variable named “df”, we will just import the dataset and create the EDA report in just a few lines of code.
import sweetviz as sv sweet_report = sv.analyze(df) sweet_report.show_html('sweet_report.html')
This step will generate the report and save it in a file named “sweet_report.html” which is user-defined.
Understanding the Report
The report contains characteristics of the different attributes along with visualization.
In this report, we can clearly see what are the different attributes of the datasets and their characteristics including the missing values, distinct values, etc.
Sweetviz also allows you to compare two different datasets or the data in the same dataset by converting it into testing and training datasets. Below given command will allow us to visualize the dataset we are using by equally distributing it in testing and training data.
df1 = sv.compare(df[102:], df[:102]) df1.show_html('Compare.html')
In this report, we can easily compare the data and the comparison between the datasets. Here we can see that the reports generated are easily understandable and are prepared in just 3 lines of code.
Autoviz is an open-source python library that mainly works on visualizing the relationship of the data, it can find the most impactful features and plot creative visualization in just one line of code. Autoviz is incredibly fast and highly useful.
Before Exploring Autoviz we need to install it by using pip install autoviz.
For using autoviz first we need to import the autoviz class and instantiate it.
from autoviz.AutoViz_Class import AutoViz_Class AV = AutoViz_Class()
After initiating the Autoviz class we just need to run a command which will create a visualization of the dataset.
df = AV.AutoViz('car_design.csv')
Understanding the Report
The above command will create a report which will contain the following attributes:
A. Pairwise scatter plot of all continuous variables
B. Histograms(KDE Plots) of all continuous variables
C. Violin Plots of all continuous variables
D. Heatmap of continuous variables
If we know the dependent variable in the dataset which is dependent on other variables, then we can pass it as an argument and visualize the data according to the Dependent Variable. For eg. If we consider “highway-mpg” as a dependent variable then we will use the below-given command to visualize the data according to the dependent variable.
df = AV.AutoViz('car_design.csv', depVar='highway-mpg')
This will create the same report as we have seen above but in the context of the dependent variable i.e. highway-mpg.
In this article, we have learned how we can automate the EDA process which is generally a time taking process. We have learned about three open-source python libraries which can be used for Automating, namely: Pandas-Profiling, Sweetviz, and Autoviz. All the libraries are easy to use and create a detailed report about the different characteristics of data and visualization for correlations and comparisons.