Analysing the dataset helps us in identifying the data and its attributes, and address anomalies in the dataset, find out whether the data is clean or not, if it is following any particular pattern or not and so on.
Cleaning the data is a necessary step in which we clean the data according to certain standards like removing or replacing the null values, removing or replacing the outliers, etc. It helps us get a clear view of the data without any anomalies or junk data.
Sign up for your weekly dose of what's up in emerging technology.
There are certain techniques defined in python which we can use to analyse the dataset and clean it before drawing certain insights from the data. QuickDA is an open-source python library that helps in analysing and cleaning a dataset easily and efficiently with few lines of code.
In this article we will explore:
- Data Analysis Using QuickDA
- Data Cleaning using QuickDA
- Data Visualisation using QuickDA
Like any other python library, we first need to install QuickDA to explore it using pip install quickda
- Importing required libraries
QuickDA Contains different functions for different purposes. We will import all the functions and use them according to our purpose.
import pandas as pd
from quickda.explore_data import *
from quickda.clean_data import *
from quickda.explore_numeric import *
from quickda.explore_categoric import *
from quickda.explore_numeric_categoric import *
from quickda.explore_time_series import *
- Loading the Dataset
Here we will be using a car design dataset that contains different attributes of cars from different automobile makers. We will load this dataset and perform operations on it.
df = pd.read_csv(‘car_design.csv’)
Here we can see that data contains some junk value i.e. ‘?’ that we will remove in the further steps.
- Analysing the Dataset
We will start analysing the dataset by displaying the statistical properties of the dataset using the explore function which is similar to the describe function that displays all major statistical properties of the dataset.
Dataprep has inbuilt pandas-profiling which allows us to create the EDA report of the data. This report contains an analysis of each and every attribute of the dataset. We will also create this profile EDA report and name it as Design Report.
explore(df, method="profile", report_name="Design Report")
As you can see the report contains different sections for different properties of all the attributes of the dataset. With this, we will end our data analysis
Further, we will move to data cleaning.
- Data Cleaning
In this step, we will start cleaning the dataset but before that, we will standardise the column names using the clean function which renames all the columns which are not in standard form. By standard we mean the column name should not have space so it replaces spaces with ‘_’ and so on.
df = clean(df)
In the data cleaning part, we start with dropping the columns that are not relevant or will not be used for our purpose. We will remove the ‘aspiration’ column for example.
df = clean(df, method='dropcols', columns=['aspiration'])
This will drop the desired column from the data frame. Next, we will check for duplicate rows in the dataset and remove any duplicate rows.
df = clean(df, method="duplicates")
In the next step, we will replace the null or junk value we saw when we loaded the dataset with the values we want, so we will replace it with ‘NaN’.
clean(df, method="replaceval", columns=['normalized-losses'], to_replace="?", value=np.nan)
Here we can see that ‘?’ in the Normalized-Losses column is replaced by NaN as we required. The next step will be dropping any missing values in the dataset.
df = clean(df, method="dropmissing")
This will remove the NaN values from the dataset. The next step is to check the datatypes of all the attributes and convert them if the datatype is not correct.
Here we see that although price values are numeric and their data type is objective, we need to convert it to a numeric datatype.
df = clean(df, method='dtypes', columns='price', dtype='numeric')
This will convert the datatype of the price column to numeric. Similarly, we can change different columns to different data types.
- Data Visualisation
The third part is visualising univariate and bivariate data analysis. In visualisation, we will try and find out different aspects of the dataset like outliers, relationships, distribution, etc.
We will start by outlier analysis and distribution analysis, which is best performed by Box-Plots and Histograms respectively. Let us plot the outlier and distribution analysis using the ‘eda_num’ function which will create box-plots and histograms of all numeric variables.
Next, we will remove the outliers in the dataset by clean function.
df = clean(df, method='outliers')
After removing the outliers let us visualise the correlation between different numerical attributes by plotting them on a heatmap.
Plotting categorical variables is also an important part of data visualisation so next, we will plot one of the categorical variables on the bar chart.
Similarly, we can plot a bar chart for more than one variable in a single plot as shown in the image below.
eda_cat(df, x='body-style', y='num-of-doors')
Other than this, we can also plot different charts and plots. Some of them are given below.
- Scatter Plot
eda_numcat(df, x='city-mpg', y='highway-mpg',
- Violin Plots
eda_numcat(df, x='num-of-doors', y='city-mpg', method='comparison')
In this article, we started with data analysis and saw how we can save time and efforts for analysis using simple and one-line commands of QuickDA. It is followed by data cleaning which is an integral part of data manipulation and EDA. QuickDA makes it easy for users to manipulate and clean data easily. The final step is to analyse the dataset and the relationship of the attributes which generally takes time and effort, but were simply created with just one line of code. QuickDA, therefore, is a powerful tool that can be used to Analyse, Manipulate, and Visualise any dataset easily and effortlessly.