MITB Banner

Hands-on Tutorial On QuickDA For Data Analysis and Cleaning

Share

quickda

Analysing the dataset helps us in identifying the data and its attributes, and address anomalies in the dataset, find out whether the data is clean or not, if it is following any particular pattern or not and so on. 

Cleaning the data is a necessary step in which we clean the data according to certain standards like removing or replacing the null values, removing or replacing the outliers, etc. It helps us get a clear view of the data without any anomalies or junk data.

There are certain techniques defined in python which we can use to analyse the dataset and clean it before drawing certain insights from the data. QuickDA is an open-source python library that helps in analysing and cleaning a dataset easily and efficiently with few lines of code.

In this article we will explore:

  1. Data Analysis Using QuickDA
  2. Data Cleaning using QuickDA
  3. Data Visualisation using QuickDA

Implementation:

Like any other python library, we first need to install QuickDA  to explore it using pip install quickda

  1. Importing required libraries

QuickDA Contains different functions for different purposes. We will import all the functions and use them according to our purpose.

import pandas as pd

from quickda.explore_data import *

from quickda.clean_data import *

from quickda.explore_numeric import *

from quickda.explore_categoric import *

from quickda.explore_numeric_categoric import *

from quickda.explore_time_series import *

  1. Loading the Dataset

Here we will be using a car design dataset that contains different attributes of cars from different automobile makers. We will load this dataset and perform operations on it.

df = pd.read_csv(‘car_design.csv’)

df

Here we can see that data contains some junk value i.e. ‘?’ that we will remove in the further steps.

  1. Analysing the Dataset

We will start analysing the dataset by displaying the statistical properties of the dataset using the explore function which is similar to the describe function that displays all major statistical properties of the dataset.

explore(df)

Dataprep has inbuilt pandas-profiling which allows us to create the EDA report of the data. This report contains an analysis of each and every attribute of the dataset. We will also create this profile EDA report and name it as Design Report.

explore(df, method="profile", report_name="Design Report")

As you can see the report contains different sections for different properties of all the attributes of the dataset. With this, we will end our data analysis

Further, we will move to data cleaning. 

  1. Data Cleaning

In this step, we will start cleaning the dataset but before that, we will standardise the column names using the clean function which renames all the columns which are not in standard form. By standard we mean the column name should not have space so it replaces spaces with ‘_’ and so on.

df = clean(df)

In the data cleaning part, we start with dropping the columns that are not relevant or will not be used for our purpose. We will remove the ‘aspiration’ column for example.

df = clean(df, method='dropcols', columns=['aspiration'])

This will drop the desired column from the data frame. Next, we will check for duplicate rows in the dataset and remove any duplicate rows.

df = clean(df, method="duplicates")

In the next step, we will replace the null or junk value we saw when we loaded the dataset with the values we want, so we will replace it with ‘NaN’.

clean(df, method="replaceval", columns=['normalized-losses'], to_replace="?",   value=np.nan)

Here we can see that ‘?’ in the Normalized-Losses column is replaced by NaN as we required. The next step will be dropping any missing values in the dataset.

df = clean(df, method="dropmissing")

This will remove the NaN values from the dataset. The next step is to check the datatypes of all the attributes and convert them if the datatype is not correct.

 df.dtypes

Here we see that although price values are numeric and their data type is objective, we need to convert it to a numeric datatype.

df = clean(df, method='dtypes', columns='price', dtype='numeric')

This will convert the datatype of the price column to numeric. Similarly, we can change different columns to different data types. 

  1. Data Visualisation

The third part is visualising univariate and bivariate data analysis. In visualisation, we will try and find out different aspects of the dataset like outliers, relationships, distribution, etc. 

We will start by outlier analysis and distribution analysis, which is best performed by Box-Plots and Histograms respectively. Let us plot the outlier and distribution analysis using the ‘eda_num’ function which will create box-plots and histograms of all numeric variables.

eda_num(df)

Outlier Analysis(Box-Plot)

Distribution Analysis(Histograms)

Next, we will remove the outliers in the dataset by clean function.

df = clean(df, method='outliers')

After removing the outliers let us visualise the correlation between different numerical attributes by plotting them on a heatmap.

eda_num(df, method="correlation")

QuickDA

Plotting categorical variables is also an important part of data visualisation so next, we will plot one of the categorical variables on the bar chart.

eda_cat(df, x='num-of-doors')

QuickDA

Similarly, we can plot a bar chart for more than one variable in a single plot as shown in the image below.

eda_cat(df, x='body-style', y='num-of-doors')

QuickDA

Other than this, we can also plot different charts and plots. Some of them are given below.

  1. Scatter Plot

eda_numcat(df, x='city-mpg', y='highway-mpg', 

           hue='num-of-doors', method='relationship')

QuickDA
  1. Violin Plots

eda_numcat(df, x='num-of-doors', y='city-mpg', method='comparison')

QuickDA

Conclusion:

In this article, we started with data analysis and saw how we can save time and efforts for analysis using simple and one-line commands of QuickDA. It is followed by data cleaning which is an integral part of data manipulation and EDA. QuickDA makes it easy for users to manipulate and clean data easily. The final step is to analyse the dataset and the relationship of the attributes which generally takes time and effort, but were simply created with just one line of code. QuickDA, therefore, is a powerful tool that can be used to Analyse, Manipulate, and Visualise any dataset easily and effortlessly. 

Share
Picture of Himanshu Sharma

Himanshu Sharma

An aspiring Data Scientist currently Pursuing MBA in Applied Data Science, with an Interest in the financial markets. I have experience in Data Analytics, Data Visualization, Machine Learning, Creating Dashboards and Writing articles related to Data Science.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.