Creating Deep Learning Models For Tabular Data using RTDL

RTDL (Revisiting Tabular Deep Learning) is an open-source  Python package based on implementing the paper “Revisiting Deep Learning Models for Tabular Data”. The library leverages the ease of creating a Deep Learning Model and can be used by practitioners and programmers looking to implement Deep Learning models in tabular data.

The development of better deep learning models in recent times and their ability to extract relevant information from various kinds of data has led to the creation of further possibilities in training the algorithms to identify decisive patterns and discover clinical findings that general practitioners would not be able to discern. More research in this field of data science has only recently started to appear. However, it has been getting lots of attention among the masses interested lately. The recent developments have led to delivering certain results that were not thought to be possible anytime before. Deep learning can be defined as a machine learning technique that teaches computers to learn by example, just like humans do. For example, deep learning has been a key technology behind driverless-self driving cars, enabling them with the power and thinking to recognize a stop sign or to distinguish between a pedestrian and a lamppost. It is also the key to many voice control technologies in consumer devices like phones, tablets and smart TVs. In Deep Learning, the created computer model learns to perform classification tasks directly from images, text, or sound data. 

These models can help achieve state-of-the-art accuracy, sometimes even exceeding human-level performance. Models can be trained using a large set of labelled data, and its neural network architecture comprises several processing layers, which makes the network actually “Deep”. The larger the amount of labelled data, the more the recognition and classification accuracy at higher levels. For example, creating a driverless car model would require millions of images and thousands of hours of videos to train the model on and help it understand better. Deep learning requires substantially high computing power as well. Well, structured models along with high-performance GPUs would make a more efficient deep learning architecture. When combined with clusters or cloud computing, this enables one to reduce the training time for a deep learning network to hours or less. Iterations within the model are continued until the output reaches an acceptable level of accuracy

Deep learning techniques help eliminate some of the data pre-processing typically involved with traditional machine learning techniques. The input and output layers present in a deep neural network are known as visible layers. The input layer is a point where the deep learning model ingests the data from to process, and the output layer is where the final prediction or classification for the given problem is made. Real-world deep learning applications are a part of our daily lives these days. In most cases, they have been so well integrated into products and services that we users are at times unaware of the complex data processing that has been taking place in the background. Overall, using automatic feature engineering and Deep Learning’s self-learning capabilities, the algorithms need little to no human intervention. This also shows and tells us about the huge potential of Deep Learning and helps brainstorm and develop more ideas. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

What is RTDL?

RTDL (Revisiting Tabular Deep Learning) is an open-source  Python package based on implementing the paper “Revisiting Deep Learning Models for Tabular Data”. The library leverages the ease of creating a Deep Learning Model and can be used by practitioners and programmers looking to implement Deep Learning models in tabular data. It can also serve as a source of comparative baselines for researchers with other traditional libraries. Given the library’s high performance and simplicity, it can help for future works on tabular DL. It comprises a design based on Attention Transformer architectures. 

Getting Started With Code

In this article, we will implement a basic Deep Learning Model using RTDL on tabular Data and predict the RMSE score that will provide us with the variability of the prediction accuracy and relative the variance of the model. The following implementation is partially inspired by the creators of RTDL, which can be accessed using the link here

Installing the Library

To start with our model creation, we will first be installing the required libraries. The following lines can be run to do so,

#Installing required libraries
!pip install rtdl
!pip install libzero==0.0.4

We are also installing the libzero package here, which is a zero-overhead library for Pytorch.

Importing Dependencies

We will now be importing our further required dependencies for the RTDL library, 

#importing the dependencies
import rtdl
import sklearn.datasets
import sklearn.model_selection
import sklearn.preprocessing
import torch
import torch.nn as nn
import torch.nn.functional as F
import zero
Loading  and Processing the Data

Next, we will be loading our data to be processed. We will be using the California housing data set, readily available in sklearn, which contains housing data drawn from the 1990 U.S. Census. We will be splitting it into train and test and also preprocessing the features present. 

#importing the dataset
dataset = sklearn.datasets.fetch_california_housing()
X_all = dataset['data'].astype('float32')
y_all = dataset['target'].astype('float32')
X = {}
y = {}

#splitting into train test
X['train'], X['test'], y['train'], y['test'] = sklearn.model_selection.train_test_split(
    X_all, y_all, train_size=0.8
)
 
#for validation
X['train'], X['val'], y['train'], y['val'] = sklearn.model_selection.train_test_split(
    X['train'], y['train'], train_size=0.8
)
 
# Preprocess features present
preprocess = sklearn.preprocessing.StandardScaler().fit(X['train'])
X = {
    k: torch.tensor(preprocess.fit_transform(v), device=device)
    for k, v in X.items()
}
 
# applying formula for solving regression problem
y_mean = float(y['train'].mean())
y_std = float(y['train'].std())
y = {
    k: torch.tensor((v - y_mean) / y_std, device=device)
    for k, v in y.items()
}

We will be applying a feature transformer, which will help improve the model’s performance by reducing bias, defining relationships, removing outliers, and more. 

#Applying Feature Transformer
model = rtdl.FTTransformer.make_default(
    n_num_features=X_all.shape[1],
    cat_cardinalities=None,
    last_layer_query_idx=[-1], 
    d_out=1,
)
 
#setting up the optimizer model
 
model.to(device)
optimizer = (
    model.make_default_optimizer()
    if isinstance(model, rtdl.FTTransformer)
    else torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
)
Final Model Setup

Finally, let us set up the model pipeline that will help us apply the feature transformer and metrics to calculate the RMSE value.

#applying the model
def apply_model(x_num, x_cat=None):
    # rtdl.FTTransformer expects two inputs: x_num and x_cat
    return model(x_num, x_cat) if isinstance(model, rtdl.FTTransformer) else model(x_num)
 
 
@torch.no_grad()
def evaluate(part):
    model.eval()
 
#calculating rmse
    mse = F.mse_loss(apply_model(X[part]).squeeze(1), y[part]).item()
    rmse = mse ** 0.5 * y_std
    return rmse
 
#Setting the batch size
batch_size = 256
train_loader = zero.data.IndexLoader(len(X['train']), batch_size, device=device)


progress = zero.ProgressTracker(patience=100)
 
print(f 'Test RMSE before training: {evaluate("test"):.4f}')
Calculating RMSE & Best Validation Epoch

We derive our final output and test the model metrics while printing the best validation epoch value to see how good our model can perform.

#setting epoch size
n_epochs = 50
for epoch in range(1, n_epochs + 1):
    for batch_idx in train_loader:
        model.train()
        optimizer.zero_grad()
        x_batch = X['train'][batch_idx]
        y_batch = y['train'][batch_idx]
        F.mse_loss(apply_model(x_batch).squeeze(1), y_batch).backward()
        optimizer.step()
 
    val_rmse = evaluate('val')
    test_rmse = evaluate('test')
    print(f'Epoch {epoch:03d} | Validation RMSE: {val_rmse:.4f} | Test RMSE: {test_rmse:.4f}', end='')
    progress.update(-val_rmse)
    if progress.success:
        print(' <<< BEST VALIDATION EPOCH', end='')
    print()
    if progress.fail:
        break

Output :

Epoch 001 | Validation RMSE: 0.8441 | Test RMSE: 0.5852
Epoch 002 | Validation RMSE: 0.8252 | Test RMSE: 0.5731
Epoch 003 | Validation RMSE: 0.8061 | Test RMSE: 0.5749 <<< BEST VALIDATION EPOCH
Epoch 004 | Validation RMSE: 0.7962 | Test RMSE: 0.5685 <<< BEST VALIDATION EPOCH
Epoch 005 | Validation RMSE: 0.8068 | Test RMSE: 0.5722
Epoch 006 | Validation RMSE: 0.8254 | Test RMSE: 0.5689
Epoch 007 | Validation RMSE: 0.7939 | Test RMSE: 0.5705 <<< BEST VALIDATION EPOCH
Epoch 008 | Validation RMSE: 0.7914 | Test RMSE: 0.5724 <<< BEST VALIDATION EPOCH
Epoch 009 | Validation RMSE: 0.8042 | Test RMSE: 0.5596
Epoch 010 | Validation RMSE: 0.8531 | Test RMSE: 0.5759
Epoch 011 | Validation RMSE: 0.7862 | Test RMSE: 0.5678 <<< BEST VALIDATION EPOCH
Epoch 012 | Validation RMSE: 0.7942 | Test RMSE: 0.5651
Epoch 013 | Validation RMSE: 0.7868 | Test RMSE: 0.5715
Epoch 014 | Validation RMSE: 0.7719 | Test RMSE: 0.5812 <<< BEST VALIDATION EPOCH
Epoch 015 | Validation RMSE: 0.7440 | Test RMSE: 0.5695 <<< BEST VALIDATION EPOCH
Epoch 016 | Validation RMSE: 0.7833 | Test RMSE: 0.5657
Epoch 017 | Validation RMSE: 0.8052 | Test RMSE: 0.5711
Epoch 018 | Validation RMSE: 0.7634 | Test RMSE: 0.5750
Epoch 019 | Validation RMSE: 0.7330 | Test RMSE: 0.5661 <<< BEST VALIDATION EPOCH
Epoch 020 | Validation RMSE: 0.7520 | Test RMSE: 0.5582
Epoch 021 | Validation RMSE: 0.8038 | Test RMSE: 0.5611
Epoch 022 | Validation RMSE: 0.7813 | Test RMSE: 0.5636
Epoch 023 | Validation RMSE: 0.7614 | Test RMSE: 0.5764
Epoch 024 | Validation RMSE: 0.7748 | Test RMSE: 0.5704
Epoch 025 | Validation RMSE: 0.7430 | Test RMSE: 0.5589
Epoch 026 | Validation RMSE: 0.7686 | Test RMSE: 0.5487
Epoch 027 | Validation RMSE: 0.7350 | Test RMSE: 0.5523
Epoch 028 | Validation RMSE: 0.7862 | Test RMSE: 0.5596
Epoch 029 | Validation RMSE: 0.7472 | Test RMSE: 0.5727
Epoch 030 | Validation RMSE: 0.7427 | Test RMSE: 0.5603
Epoch 031 | Validation RMSE: 0.7618 | Test RMSE: 0.5583
Epoch 032 | Validation RMSE: 0.7394 | Test RMSE: 0.5573
Epoch 033 | Validation RMSE: 0.7671 | Test RMSE: 0.5607
Epoch 034 | Validation RMSE: 0.7604 | Test RMSE: 0.5633
Epoch 035 | Validation RMSE: 0.7439 | Test RMSE: 0.5540
Epoch 036 | Validation RMSE: 0.7596 | Test RMSE: 0.5533
Epoch 037 | Validation RMSE: 0.7731 | Test RMSE: 0.5621
Epoch 038 | Validation RMSE: 0.7589 | Test RMSE: 0.5584
Epoch 039 | Validation RMSE: 0.7883 | Test RMSE: 0.5617
Epoch 040 | Validation RMSE: 0.7690 | Test RMSE: 0.5644
Epoch 041 | Validation RMSE: 0.7461 | Test RMSE: 0.5623
Epoch 042 | Validation RMSE: 0.7671 | Test RMSE: 0.5659
Epoch 043 | Validation RMSE: 0.7668 | Test RMSE: 0.5668
Epoch 044 | Validation RMSE: 0.7702 | Test RMSE: 0.5544
Epoch 045 | Validation RMSE: 0.7772 | Test RMSE: 0.5570
Epoch 046 | Validation RMSE: 0.7692 | Test RMSE: 0.5698
Epoch 047 | Validation RMSE: 0.7707 | Test RMSE: 0.5696
Epoch 048 | Validation RMSE: 0.7631 | Test RMSE: 0.5704
Epoch 049 | Validation RMSE: 0.7253 | Test RMSE: 0.5638 <<< BEST VALIDATION EPOCH
Epoch 050 | Validation RMSE: 0.7693 | Test RMSE: 0.5623

End Notes

In this article, we understood what Deep Learning models are and their importance, also discussed how these models and algorithms are being used. We also explored the RTDL library and implemented a basic Deep Learning model for tabular data with it. The following implementation above can be found as a Colab notebook and accessed using the link here.

Happy Learning!

References

Victor Dey
Victor is an aspiring Data Scientist & is a Master of Science in Data Science & Big Data Analytics. He is a Researcher, a Data Science Influencer and also an Ex-University Football Player. A keen learner of new developments in Data Science and Artificial Intelligence, he is committed to growing the Data Science community.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR