Now Reading
Hands-On Guide to Vaex – Tool to Overcome Drawbacks of Pandas

Hands-On Guide to Vaex – Tool to Overcome Drawbacks of Pandas

Himanshu Sharma
vaex
W3Schools

Pandas is an open-source data analysis and manipulation tool built on python. It is generally used for manipulating numerical and time-series data. It is used to create data structures like a data frame. Pandas is one of the most used python libraries but it has certain drawbacks like it uses a slow function which is not very suitable for bigger datasets, also pandas only handle results that fit in the memory which can be easily filled.

To overcome these drawbacks of Pandas, let us explore a high-performance python library for lazy Out-of-Core Dataframes named Vaex which is used to visualize and manipulate big tabular datasets. It performs different statistical functions and visualizations on very large datasets within seconds. Vaex in python uses Lazy computation and Memory mapping in which no memory is wasted. It loads a dataset with billions of rows in a few seconds. 

In this article we will explore:



  1. How to use Vaex in python?
  2. Visualization using Vaex?
  3. Comparing Vaex and Pandas  

Implementation of Vaex in Python

We will start exploring vaex but before to that, we need to install it using pip install vaex

  1. Importing libraries

We will import both pandas and vaex library as we need to compare the performance of both. 

import vaex

import pandas as pd

  1. Using Vaex

We will explore how to load a dataset in vaex and perform different operations on it. The dataset we are using here is of NYC Motor Vehicle Collision which is of around 350 MB. We will load this dataset using vaex.

df = vaex.open(‘motor_nyc.csv’)

df.head(5)

Basic Operations on the dataset:

%%time

df['NUMBER OF PEDESTRIANS KILLED'].mean()

%%time

df['CONTRIBUTING FACTOR VEHICLE 1'].value_counts()

Vaex

%%time

df['CROSS STREET NAME'].count()

  1. Visualization with Vaex

Now we will visualize some of the plots using the data frame loaded using vaex and note the time using ‘%%time’

%%time

df.plot1d(df['COLLISION_ID']);

Vaex

%%time

df.plot(df['NUMBER OF PERSONS INJURED'], df['NUMBER OF PERSONS KILLED']);

Vaex

Here we can see that despite the dataset being large in size Vaex in python did not take much time to create the plots.

Similarly, we can also create several other plots and note the time taken by Vaex as very less compared to pandas. 

Vaex has several other features like:

  1. It can read data from a large number of sources like cs, hdf5, astropy table, etc.
  2. It supports all major types of visualization like heatmaps, scatter plots, etc.
  3. It supports all statistical functions like variance, co-variance, etc. 
  4. It is blazingly fast as it works on lazy computing and zero memory copying policy.

d. Vaex V/s Pandas

Now let’s compare the time taken by pandas and vaex for different operations.

  1. Loading the same dataset

#Using Pandas

%%time

df = pd.read_csv('motor_nyc.csv')

#Using Vaex

See Also
ExploriPy

%%time

df1 = vaex.open('motor_nyc..csv')

Vaex

Here we can see that while pandas took 18 seconds Vaex loaded the same dataset in 23 milliseconds.

  1. Performing Statistical analysis

#Using Pandas

%%time

print(df['NUMBER OF PEDESTRIANS KILLED'].mean())

print(df['NUMBER OF PEDESTRIANS KILLED'].value_counts())

#Using Vaex

%%timeit

print(df1['NUMBER OF PEDESTRIANS KILLED'].mean())

print(df1['NUMBER OF PEDESTRIANS KILLED'].value_counts())

Vaex

Here also we can see that vaex is incredibly faster than pandas.

Similarly, we can try different operations using both pandas and vaex to find out that Vaex is faster than pandas.  

Conclusion: 

In this article we discussed:

  1. How we can use Vaex in python for larger datasets.
  2. Visualization using Vaex dataset 
  3. We compared pandas and vaex to find out that Vaex is pretty much faster than pandas. 
What Do You Think?

If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top