For one to perform EDA on any dataset he/she must be well versed with some of the python visualization libraries such as seaborn, matplotlib, plotly etc. to make attractive graphs so as to find the insights of the data. Finding insights into any data is a preliminary step of any data science, machine learning project as the corresponding step that is feature selection depends on the results derived from EDA. This means EDA plays a crucial role in determining the accuracy of any data science, machine learning projects.
In this blog, we shall find easier ways of performing EDA on any dataset by using some automated libraries.
The first step is to install the library by running the command
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
!pip install dtale
in the anaconda prompt or in the console itself. Once dtale library is being installed import the common operational libraries and the titanic dataset from seaborn by using the .load_dataset.
Once the dataset is loaded, embed the dataset into dtale using dtale.show(df), this will show the data frame in dtale window. On clicking the dropdown button various operations such as ranging from finding Pearson’s correlation between entities to plotting 3d,2d graphs and finding outliers in the data, every EDA function can be performed.
One amazing feature of this library is that the source code of the desired operation can be copied from the code export option which is available. For example, the code behind the Pearson’s correlation can be found by clicking on the <>Code Export button.
Pandas profiling is another amazing automated library that can perform EDA but its working is limited as its performance and operations compared to dtale is much lesser. To install this library run the command
!pip install pandas-profiling
either in the console or in the anaconda prompt. Once the installation is done the following code needs to execute.
The tips dataset is being loaded from seaborn and the columns of the dataset are shown above.
Here, we are importing ProfileReport from the installed pandas profiling library and saving the output as an Html file. The Html file gets saved in the environment directory. The output looks like as shown below and various EDA operations can be performed by navigating through the options that are available.
3. Sweet Viz
Sweetviz is also a handy automated EDA library, here we again load the titanic dataset from seaborn, before that sweetviz library needs to install this can be done by running the
!pip install sweetviz
command in the console or in the anaconda prompt.
On running the two lines of code the HTML page pops up naming output_report.
The ouput_report.html is shown below, here again, similar to other automated EDA libraries this too has the competence to perform high-level EDA.
Similar to other libraries we need to first install them
!pip install autoviz
and run the following codes, here we are using the titanic dataset for performing EDA.
from autoviz, Autoviz_class is being imported and it’s being initialized using the object AV.
Since I’m loading the dataset locally I have assigned it to filename else it can also be loaded from seaborn. On running this code a series of basic EDA charts are formed.
You can go through my jupyter notebook here and try-test with different automated EDA libraries and share what all conclusions you could grab from it or if I failed to capture any of the useful insights in my own approach, do share that too in comments.