Data preprocessing and Data normalization have become very important when it comes to implementing the data through different Machine Learning Algorithms. The data preprocessing phase can significantly affect the learning model’s outcome; therefore, all features present in the data must be on the same scale. The type of feature preprocessing and normalization technique required can vary on the data. Data preprocessing can also be defined as a data mining technique to turn the raw data gathered from multiple sources into a cleaner information channel that is more suitable to work upon. It is an essential preliminary step that takes all of the available information in the form of a dataset and performs various techniques to organize, sort, and merge it.
Such Data Science techniques try to extract information from chunks of data to create a cleaner database from numerous datasets. At times, these databases can get incredibly massive and usually contain data of all sorts, which means that they don’t share the same structure. Raw data can have missing or inconsistent values and present us with a lot of redundant information. The very purpose of the Data Preparation phase is to convert the data and database into a format suited best for Machine Learning. Data Preparation also comprises three main phases, namely Data Cleansing, Data Transformation, And Feature Engineering. High-Quality data is more essential for working with Complex Algorithms, so it is an incredibly important phase and should not be avoided by any means.
Sign up for your weekly dose of what's up in emerging technology.
Through preprocessing data:
We eliminate the incorrect or missing values that are there due to the human factor or bugs, making our databases more accurate.
When there are inconsistencies in data or duplicates, it affects the accuracy of the results, hence removing them to make our data consistent.
Filling the attributes that are missing if needed, making the data complete.
Vaex is a Python library for Out-of-Core DataFrames and helps to load, visualize and explore big tabular datasets. It can aid in calculating statistical operations such as mean, sum, count, standard deviation etc., on an N-dimensional grid, up to a billion rows per second. Visualisation can be created using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses a technique known as memory mapping, a zero memory copy policy, for best computational performance. While Pandas is largely popular for handling data in Python language, it is still eager for memory. As the data gets bigger, you will have to be careful not to have a MemoryError.
Switching to a more powerful machine may solve some memory issues, but still, Pandas will only use one out of the 32 cores of your fancy machine. With Vaex, all operations are out of the core and executed in parallel and lazily evaluated, allowing for crunching through a billion-row dataset effortlessly. Vaex can be a potential solution that can resolve all the above problems while still providing a convenient API. Vaex achieves this high performance by combining memory mapping, a zero memory copy policy, and lazy computations.
As discussed above, Vaex uses Memory mapping to solve this. All the dataset files read into vaex are memory-mapped. So, When you open a memory-mapped file with Vaex, you don’t read the data. Instead, Vaex swiftly reads the file metadata, providing solutions to open these files quickly, irrespective of how much RAM you have. The format of mappable memory files is Apache Arrow, HDF5, etc.
Getting Started with Code Implementation
This article will try to explore how the Vaex library works and how it can make the process of data preprocessing and loading humongous datasets easier compared to traditional processing frameworks such as pandas. The following code implementation was inspired by the official documentation of Vaex, whose link can be found here.
Installing the Library
First, we will install the necessary libraries to create and process our model. To do so, the following code can be run,
#installing the library !pip install --upgrade vaex !pip install ipython==7.25.0
Please remember, the Vaex library supports python 7.25.0 for operations; hence we are installing that as well.
Importing The Library
To import the Library, the following code can be used,
import vaex as vx
Reading the Dataset
Now that we have imported the necessary libraries, we will import and load the dataset to the data frame. We will be using a dataset that is tremendously huge to get a taste of Vaex’s processing power. The dataset being used has 146 Million rows of data, with its size being over 12GBs! Here we will be comparing and visualizing the routes covered by two taxis, Taxi1 & Taxi2. All other necessary details about the New York Taxi dataset can be found through the link here.
#loading into dataframe taxi1 = vx.open('s3://vaex/taxi/yellow_taxi_2015_f32s.hdf5?anon=true') #viewing rows and columns taxi1
#view shape of df1 taxi1.shape #Loading df2 taxi2 = vx.open('s3://vaex/taxi/yellow_taxi_2009_2015_f32.hdf5?anon=true') #viewing rows and columns taxi2
Visualizing the Dataset
Now that the data for both the Taxies has been loaded into the data frames, we can start visualizing its route travelled and covered in map.
#creating visulization for Taxi1 long_min = -74.05 long_max = -73.75 lat_min = 40.58 lat_max = 40.90 taxi1.plot(taxi1.pickup_longitude, taxi1.pickup_latitude, f="log1p", limits=[[-74.05, -73.75], [40.58, 40.90]], show=True);
#creating visulaization for Taxi2 long_min = -74.05 long_max = -73.75 lat_min = 40.58 lat_max = 40.90 taxi2.plot(taxi2.pickup_longitude, taxi2.pickup_latitude, f="log1p", limits=[[-74.05, -73.75], [40.58, 40.90]], show=True)
As we can observe, with Vaex, we can load and process Big Data and create powerful visualizations in a matter of seconds!
We can also benchmark the time that was required to load our dataset into the dataframe,
%%timeit taxi1 = vx.open('s3://vaex/taxi/yellow_taxi_2015_f32s.hdf5?anon=true')
23.9 ms ± 4.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The observed time seems to be in milliseconds; this tells us how fast and efficient the Vaex library really is.
In this article, we have explored the importance of data preprocessing and explored the capabilities of a library named Vaex, which helps load heavy Big Data easily into data frames. The above implementation can be found as a Colab notebook, using the link here.