MITB Banner

Hands-On Guide to Datatable Library For Faster EDA

In this article, I’ll be discussing the implementation of the datatable library with a large dataset.
Share
Datatable

Working with tabular data in data science we always use the Pandas library in Python. This is widely used for data exploration, analysis, munging and manipulation. These are the primary steps for understanding the data well and making it ready for the model to fit. The only disadvantage of using pandas is its time consuming when there’s a large amount of data(big data).

Datatable overcomes the limitations of pandas and speeds up the process of EDA(exploratory data analysis). Datatable has been built by H20.ai, one of the leading AI ML companies in the world. Datatable is pretty similar to pandas and R data.table libraries. Datatable has proper documentation. Works with Python version 3.6+.

Advantages of Datatable

  • Supports null values, date-time and categorical types.
  • Efficient algorithms for sorting/grouping/joining.
  • Minimal data copying by using “rowindex” views in filtering/sorting/grouping/joining 
  • operators to avoid unnecessary data copying.Faster data accessing than pandas 
  • Easily convert to another data-processing framework.

In this article, I’ll be discussing the implementation of the datatable library with a large dataset.

Installing datatable

pip install datatable

DatasetCredit Card Fraud Detection

The dataset contains transactions that have been made by credit cards in September 2013 by European cardholders. This dataset shows transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. A total of 31 features are time, class, amount and  V1 to V28.

Reading files

With pandas:

import pandas as pd
start = time.time()
pandas_df = pd.read_csv('../input/creditcardfraud/creditcard.csv')
end = time.time()
print(end - start)

3.0183897018432617

With datatable:

import datatable as dt
import time
start = time.time()
df = dt.fread('../input/creditcardfraud/creditcard.csv')end = time.time()
print(end - start)

0.30858349800109863

Clearly, datatable performs much better than pandas. Datatable takes 30 milliseconds to fetch the data whereas pandas take more than 3 seconds. 

fread() -> datatable parser for text(csv) 

  • Automatically detects column types, quoting rules, separators, headers, etc.
  • Reads from the file, raw text, archives, glob, URL, shell.
  • Provides multithreaded file reading facility.
  • Includes a progress indicator while reading large files.

Dataset size:

print(df.shape) 

(284807, 31)

Feature Column names

print(df.names)

(‘Time’, ‘V1’, ‘V2’, ‘V3’, ‘V4’, ‘V5’, ‘V6’, ‘V7’, ‘V8’, ‘V9’, ‘V10’, ‘V11’, ‘V12’, ‘V13’, ‘V14’, ‘V15’, ‘V16’, ‘V17’, ‘V18’, ‘V19’, ‘V20’, ‘V21’, ‘V22’, ‘V23’, ‘V24’, ‘V25’, ‘V26’, ‘V27’, ‘V28’, ‘Amount’, ‘Class’)

df.head() -> displays the first 10 rows of the dataset in a compact mode as shown below.

Column Types

print(df.stypes) 

(stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.float64, stype.bool8)

Convert to numpy array

np_arr = df.to_numpy()

Convert to pandas 

df_pd = df.to_pandas()

Convert to python list object

py_obj = df.to_list()

Statistical functions

df.sum()- returns sum of all values in specified column
df.max()- returns maximum value in specified column
df.min()- returns minimum value in specified column
df.mean()- returns mean(average) of values from a column
df.sd()- returns standard deviation of a column
df.mode()- return mode from a column

Sorting Frame -> sort() function sorts the row by the column defined in ascending order.

With pandas:

start = time.time()
pandas_df.sort_values(by="Amount")
end = time.time()
print(end - start)

0.14470458030700684

With datatable:

start = time.time()
df.sort("Amount")
end = time.time()
print(end - start)

0.03646421432495117

GroupBy

Let us get the mean amount for each V1 feature. In datatable, operations of a Frame can be represented as dt[i,j,…] where i is row selector, j is column selector and .. are other modifiers. Derived from matrix notations.

start = time.time()
for i in range(500):
    pandas_df.groupby("V1")["Amount"].sum()
end = time.time()
print(end - start)

48.4233283996582

start = time.time()
for i in range(10):
    df[:, dt.sum(dt.f.Amount), dt.by(dt.f.V1)]
end = time.time()
print(end - start)

16.062947988510132

.f in dt.f means frame proxy referring to currently calling frame.

Here Datatable takes 1/4th the time of pandas.

Appending rows/columns to Frames:

cbind() for binding columns and rbind() for binding rows to existing Frame.

df.cbind(col1, col2)
df.rbind(rowname, force=True)

Deleting a column

del df[:, 'V27']

Saving Frames

Saving Frame in disk as binary format and opening it later instantly

df.to_jay("out.jay")
df_dt = dt.open("out.jay")

Write the Frame

df.to_csv(“out.csv”)

Conclusion

Datatable has a similar syntax to pandas and provides predominantly better performance speed for all operations. Datatable covers almost all types of text and tabular data formats.

Highly efficient when working with large amounts of data(having 100GB in RAM). Active contributions and development is going onto datatable as in some advanced functionalities it lacks as compared to pandas. 

The complete code of the above implementation is available at the AIM’s GitHub repository. Please visit this link to find the notebook of this code.

PS: The story was written using a keyboard.
Picture of Jayita Bhattacharyya

Jayita Bhattacharyya

Machine learning and data science enthusiast. Eager to learn new technology advances. A self-taught techie who loves to do cool stuff using technology for fun and worthwhile.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed