Pandas is an open-source data analysis and manipulation tool built on python. It is generally used for manipulating numerical and time-series data. It is used to create data structures like a data frame. Pandas is one of the most used python libraries but it has certain drawbacks like it uses a slow function which is not very suitable for bigger datasets, also pandas only handle results that fit in the memory which can be easily filled.
To overcome these drawbacks of Pandas, let us explore a high-performance python library for lazy Out-of-Core Dataframes named Vaex which is used to visualize and manipulate big tabular datasets. It performs different statistical functions and visualizations on very large datasets within seconds. Vaex in python uses Lazy computation and Memory mapping in which no memory is wasted. It loads a dataset with billions of rows in a few seconds.
In this article we will explore:
- How to use Vaex in python?
- Visualization using Vaex?
- Comparing Vaex and Pandas
Implementation of Vaex in Python
We will start exploring vaex but before to that, we need to install it using pip install vaex
- Importing libraries
We will import both pandas and vaex library as we need to compare the performance of both.
import pandas as pd
- Using Vaex
We will explore how to load a dataset in vaex and perform different operations on it. The dataset we are using here is of NYC Motor Vehicle Collision which is of around 350 MB. We will load this dataset using vaex.
df = vaex.open(‘motor_nyc.csv’)
Basic Operations on the dataset:
df['NUMBER OF PEDESTRIANS KILLED'].mean()
df['CONTRIBUTING FACTOR VEHICLE 1'].value_counts()
df['CROSS STREET NAME'].count()
- Visualization with Vaex
Now we will visualize some of the plots using the data frame loaded using vaex and note the time using ‘%%time’
df.plot(df['NUMBER OF PERSONS INJURED'], df['NUMBER OF PERSONS KILLED']);
Here we can see that despite the dataset being large in size Vaex in python did not take much time to create the plots.
Similarly, we can also create several other plots and note the time taken by Vaex as very less compared to pandas.
Vaex has several other features like:
- It can read data from a large number of sources like cs, hdf5, astropy table, etc.
- It supports all major types of visualization like heatmaps, scatter plots, etc.
- It supports all statistical functions like variance, co-variance, etc.
- It is blazingly fast as it works on lazy computing and zero memory copying policy.
d. Vaex V/s Pandas
Now let’s compare the time taken by pandas and vaex for different operations.
- Loading the same dataset
df = pd.read_csv('motor_nyc.csv')
df1 = vaex.open('motor_nyc..csv')
Here we can see that while pandas took 18 seconds Vaex loaded the same dataset in 23 milliseconds.
- Performing Statistical analysis
print(df['NUMBER OF PEDESTRIANS KILLED'].mean())
print(df['NUMBER OF PEDESTRIANS KILLED'].value_counts())
print(df1['NUMBER OF PEDESTRIANS KILLED'].mean())
print(df1['NUMBER OF PEDESTRIANS KILLED'].value_counts())
Here also we can see that vaex is incredibly faster than pandas.
Similarly, we can try different operations using both pandas and vaex to find out that Vaex is faster than pandas.
In this article we discussed:
- How we can use Vaex in python for larger datasets.
- Visualization using Vaex dataset
- We compared pandas and vaex to find out that Vaex is pretty much faster than pandas.