How To Process Humongous Datasets Using Vaex?

Vaex is a Python library for Out-of-Core DataFrames and helps to load, visualize and explore big tabular datasets. It can aid in calculating statistical operations such as mean, sum, count, standard deviation etc., on an N-dimensional grid, up to a billion rows per second.

Data preprocessing and Data normalization have become very important when it comes to implementing the data through different Machine Learning Algorithms. The data preprocessing phase can significantly affect the learning model’s outcome; therefore, all features present in the data must be on the same scale. The type of feature preprocessing and normalization technique required can vary on the data. Data preprocessing can also be defined as a data mining technique to turn the raw data gathered from multiple sources into a cleaner information channel that is more suitable to work upon. It is an essential preliminary step that takes all of the available information in the form of a dataset and performs various techniques to organize, sort, and merge it.

Such Data Science techniques try to extract information from chunks of data to create a cleaner database from numerous datasets. At times, these databases can get incredibly massive and usually contain data of all sorts, which means that they don’t share the same structure. Raw data can have missing or inconsistent values and present us with a lot of redundant information. The very purpose of the Data Preparation phase is to convert the data and database into a format suited best for Machine Learning. Data Preparation also comprises three main phases, namely Data Cleansing, Data Transformation, And Feature Engineering. High-Quality data is more essential for working with Complex Algorithms, so it is an incredibly important phase and should not be avoided by any means.

Why Data Preprocessing is so important?

Through preprocessing data:
We eliminate the incorrect or missing values that are there due to the human factor or bugs, making our databases more accurate.
When there are inconsistencies in data or duplicates, it affects the accuracy of the results, hence removing them to make our data consistent.
Filling the attributes that are missing if needed, making the data complete.

What is Vaex (python library)?

Vaex is a Python library for Out-of-Core DataFrames and helps to load, visualize and explore big tabular datasets. It can aid in calculating statistical operations such as mean, sum, count, standard deviation etc., on an N-dimensional grid, up to a billion rows per second. Visualisation can be created using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses a technique known as memory mapping, a zero memory copy policy, for best computational performance. While Pandas is largely popular for handling data in Python language, it is still eager for memory. As the data gets bigger, you will have to be careful not to have a MemoryError.

Switching to a more powerful machine may solve some memory issues, but still, Pandas will only use one out of the 32 cores of your fancy machine. With Vaex, all operations are out of the core and executed in parallel and lazily evaluated, allowing for crunching through a billion-row dataset effortlessly. Vaex can be a potential solution that can resolve all the above problems while still providing a convenient API. Vaex achieves this high performance by combining memory mapping, a zero memory copy policy, and lazy computations.

As discussed above, Vaex uses Memory mapping to solve this. All the dataset files read into vaex are memory-mapped. So, When you open a memory-mapped file with Vaex, you don’t read the data. Instead, Vaex swiftly reads the file metadata, providing solutions to open these files quickly, irrespective of how much RAM you have.  The format of mappable memory files is Apache Arrow, HDF5, etc.

Getting Started with Code Implementation

This article will try to explore how the Vaex library works and how it can make the process of data preprocessing and loading humongous datasets easier compared to traditional processing frameworks such as pandas. The following code implementation was inspired by the official documentation of Vaex, whose link can be found here

Installing the Library

First, we will install the necessary libraries to create and process our model. To do so, the following code can be run, 

#installing the library
!pip install --upgrade vaex
!pip install ipython==7.25.0

Please remember, the Vaex library supports python 7.25.0 for operations; hence we are installing that as well.

Importing The Library

To import the Library, the following code can be used, 

import vaex as vx
Reading the Dataset

Now that we have imported the necessary libraries, we will import and load the dataset to the data frame. We will be using a dataset that is tremendously huge to get a taste of Vaex’s processing power. The dataset being used has 146 Million rows of data, with its size being over 12GBs! Here we will be comparing and visualizing the routes covered by two taxis, Taxi1 & Taxi2. All other necessary details about the New York Taxi dataset can be found through the link here

#loading into dataframe
taxi1 ='s3://vaex/taxi/yellow_taxi_2015_f32s.hdf5?anon=true')

#viewing rows and columns

Output : 

#view shape of df1

#Loading df2
taxi2 ='s3://vaex/taxi/yellow_taxi_2009_2015_f32.hdf5?anon=true')

#viewing rows and columns
Visualizing the Dataset

Now that the data for both the Taxies has been loaded into the data frames, we can start visualizing its route travelled and covered in map.

#creating visulization for Taxi1
long_min = -74.05
long_max = -73.75
lat_min = 40.58
lat_max = 40.90
taxi1.plot(taxi1.pickup_longitude, taxi1.pickup_latitude, f="log1p", limits=[[-74.05, -73.75], [40.58, 40.90]], show=True);


#creating visulaization for Taxi2
long_min = -74.05
long_max = -73.75
lat_min = 40.58
lat_max = 40.90
taxi2.plot(taxi2.pickup_longitude, taxi2.pickup_latitude, f="log1p", limits=[[-74.05, -73.75], [40.58, 40.90]], show=True)

As we can observe, with Vaex, we can load and process Big Data and create powerful visualizations in a matter of seconds!

We can also benchmark the time that was required to load our dataset into the dataframe,

taxi1 ='s3://vaex/taxi/yellow_taxi_2015_f32s.hdf5?anon=true')

Output :

23.9 ms ± 4.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The observed time seems to be in milliseconds; this tells us how fast and efficient the Vaex library really is.


In this article, we have explored the importance of data preprocessing and explored the capabilities of a library named Vaex, which helps load heavy Big Data easily into data frames. The above implementation can be found as a Colab notebook, using the link here.


Download our Mobile App

Victor Dey
Victor is an aspiring Data Scientist & is a Master of Science in Data Science & Big Data Analytics. He is a Researcher, a Data Science Influencer and also an Ex-University Football Player. A keen learner of new developments in Data Science and Artificial Intelligence, he is committed to growing the Data Science community.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week. 

How Generative AI is Revolutionising Data Science Tools

How Generative AI is Revolutionising Data Science Tools

Einblick Prompt enables users to create complete data workflows using natural language, accelerating various stages of data science and analytics. Einblick has effectively combined the capabilities of a Jupyter notebook with the user-friendliness of ChatGPT.