Tutorial For DataPrep – A Python Library to Prepare Your Data Before Training

In this article, we will explore what all we can do using DataPrep with using its features.
DataPrep

Preparing your data before using it to train or test the machine learning model is really important to get accurate and precise results. Preparing the data can be a tiresome task because it takes a lot of effort and time to analyze the data and prepare it according to our requirements.

Dataprep is an open-source python library that allows you to prepare your data and that too with just a few lines of code. By preparing data it means that we can analyze the properties of the attributes that are there in the data. In the current version of DataPrep, they have a very useful module named EDA(Exploratory Data Analysis).

In this article, we will explore what all we can do using DataPrep with using its features.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Implementation of DataPrep

Like any other python library, we need to install DataPrep using pip install dataprep

  1. Importing required libraries

DataPrep contains different functions for different operations. We will start by importing the plot function which is used to visualize the statistical plots and properties of the dataset. Also, we will be importing plotly express as we will use it to download the dataset we will be working on.


Download our Mobile App



import plotly.express as px

from dataprep.eda import plot

  1. Loading the Dataset

In this article, we will be using the sample dataset named ‘tips’ which can be downloaded using plotly express. The dataset contains certain attributes related to hotel bills and tips.

df = px.data.tips() 

df

  1. Exploratory Data Analysis

We will start with statistical data exploration and analysis. The plot function is used for preparing this statistical report. This single line of code will create the whole statistical analysis. 

plot(df)

As you can see in the output image above it displays the statistical properties frequency and count of all the attributes. If we click the button ‘Show Stats Info’ we will see the statistical information as shown below.

We can also perform statistical data analysis for individual attributes also, which will give us a clear idea about each attribute also it supports different plots like KDE Plot, Box Plot, etc.

plot(df, ‘tip’)

By clicking different plots we can visualize the ‘tip’ attribute in that plot. For example, in the image below you can see the box plot of the ‘tip’ column. 

The next function we will be importing and using is the plot_correlation which allows us to create a heatmap of the correlations of the different attributes of the dataset. Heatmaps give us a clear view of the relationship between different attributes. Let us plot the correlation heatmap. Dataprep not only plots the heatmap it gives you three variants of it namely Pearson, Spearman, and Kendall Tau.

from dataprep.eda import plot_correlation

plot_correlation(df)

DataPrep

Dataprep allows us to visualize any missing data in our dataset, finding out missing data is mandatory while preparing the data so that we can replace it with useful data accordingly. For visualizing the missing data we will use an advertising dataset that has some missing values. You can use any dataset which contains some missing values to appreciate the visualization of the missing data.   

This visualization can be viewed in 3 different plots as you will see in the output images below.

import pandas as pd

df1= pd.read_csv(‘Advertising.csv’)

from dataprep.eda import plot_missing

plot_missing(df)

DataPrep
DataPrep
DataPrep

These 3 images clearly help us visualize the missing data in the dataset which will help us to prepare the data by removing these missing values or replacing them with relevant data.

This is how we can use DataPrep to prepare our data for further processing.

Conclusion:

In this article, we have seen how we can use different functions of the EDA module of DataPrep library to prepare our data for further processing. DtaPrep is easy to use and saves time and effort. DataPrep allows you to analyze every aspect of data with its variety of functions. According to the makers of DataPrep soon they will be releasing some more modules for data preparation.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Himanshu Sharma
An aspiring Data Scientist currently Pursuing MBA in Applied Data Science, with an Interest in the financial markets. I have experience in Data Analytics, Data Visualization, Machine Learning, Creating Dashboards and Writing articles related to Data Science.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.