How to Handle Tabular Data for Deep Learning Using PyTorch Tabular?

PyTorch Tabular is a framework for deep learning using tabular data that aims to make it simple and accessible to both real-world applications and academics. The following are the design principles for the library:

In tabular data, deep learning has traditionally lagged behind the popular Gradient Boosting in terms of popularity and performance. However, newer models developed expressly for tabular data, such as XBNet, have recently pushed the performance bar. In this post, we will look into PyTorch Tabular, a framework designed specifically for tabular data with an intention to make deep learning with tabular data easy and accessible to real-world cases. We will discuss how this framework was made, what design principles it follows, and how it can be applied. The major points to be discussed in this article are listed below.

Table of Contents

  1. The PyTorch Tabular
  2. Design of Library
  3. Implementing PyTorch Tabular

Let’s start the discussion by knowing the creation of the framework.

The PyTorch Tabular

PyTorch Tabular is a framework for deep learning using tabular data that aims to make it simple and accessible to both real-world applications and academics. The following are the design principles for the library:

  • Low resistance and usability
  • Customization is simple
  • Scalable and easier to set up

PyTorch Tabular aims to make dealing with Neural Networks’ software engineering as straightforward and painless as possible, enabling you to focus on the model. It also aims to bring together the many breakthroughs in the Tabular sector into a single framework with a common API that can be used with a variety of cutting-edge models. It also comes with a base model that can be readily customized to help Deep Learning researchers create new tabular data architectures.

The PyTorch Tabular stands on the shoulders of titans such as PyTorch, PyTorch Lightning, and Pandas.

Source

Design of Library

PyTorch Tabular is intended to make the standard modeling pipeline simple enough for practitioners while also being reliable enough for production use. It also focuses on customization so that it can be used in a variety of research settings. The below picture depicts the structure of the framework.

Source

Now let’s briefly discuss all the modules from the framework. We first start with Configuration Modules.

Data Config

DataConfig is where we define the parameters for how we will manage data within the pipeline. This configuration differentiates between categorical and continuous features, determines normalization or feature transformations, and so on.

Model Config

For each model implemented in the PyTorch Tabular, a new ModelConfig is defined. It derives from a base ModelConfig that contains common parameters such as job (classification or regression), learning rate, loss, metrics, and so forth. Each model developed inherits these parameters and adds model-specific hyperparameters to the configuration. PyTorch Tabular automatically initializes the correct model by selecting the matching ModelConfig.

Trainer Config

TrainerConfig manages all of the parameters that control your training, with the PyTorch Lightning layer receiving the majority of them. Batch size, max epochs, early stopping, and other parameters can be set here.

Optimizer Config

Another important aspect of training a neural network is optimizers and learning rate schedules. The OptimizerConfig can be used to make these changes.

Experimental Config

Experiment tracking is practically a requirement of machine learning. It’s crucial for maintaining reproducibility. Internally, the PyTorch Tabular recognizes this and provides experiment tracking. Tensorboard and Weights & Biases are the two experiment tracking frameworks that PyTorch Tabular currently supports.

Base Model

PyTorch Tabular makes use of the abstract BaseModel class, which implements the standard parts of any model definition, such as loss and metric calculation, and so on. This class acts as a foundation for any other model and guarantees that the model and the training engine work together seamlessly. The model initialization component and the forward pass are the only two methods that a new model must implement if it inherits this class.

Data Module

The Data Module, as specified by PyTorch Lightning, is used by PyTorch Tabular to unify and standardize data processing. It includes preprocessing, label encoding, category encoding, feature transformations, target transformations, and other data processing, as well as ensuring that the same data processing is performed to train and validate splits, as well as fresh and unseen data. PyTorch data loaders are provided for training and inference.

Implementing PyTorch Tabular

In this section, we will implement the framework with the support of the SK-Learn module for the dataset supplies and evaluation metrics. 

Install all the PyTorch Tabular with its core functionality using the pip as

 ! pip install PyTorch_tabular[all]

  1. Import the dependencies
from PyTorch_tabular import TabularModel
from PyTorch_tabular.models import CategoryEmbeddingModelConfig
from PyTorch_tabular.config import DataConfig, OptimizerConfig, TrainerConfig, ExperimentConfig
 
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import random
import numpy as np
import pandas as pd
import os
  1. Below we will write a function to evaluate the network and will load the data.
# Function to evaluate network
def print_metrics(y_true, y_pred, tag):
    if isinstance(y_true, pd.DataFrame) or isinstance(y_true, pd.Series):
        y_true = y_true.values
    if isinstance(y_pred, pd.DataFrame) or isinstance(y_pred, pd.Series):
        y_pred = y_pred.values
    if y_true.ndim>1:
        y_true=y_true.ravel()
    if y_pred.ndim>1:
        y_pred=y_pred.ravel()
    val_acc = accuracy_score(y_true, y_pred)
    val_f1 = classification_report(y_true, y_pred)
    print(f"{tag} Acc: {val_acc} | {tag} Classification Report \n: {val_f1}")
# prepare data in the form that framework accepts
data = load_digits()
file1 = pd.DataFrame(data.data,columns=data.feature_names)
file2 = pd.DataFrame(data.target, columns=['target'])
 
data = pd.concat([file1,file2],axis=1)
cat_col_names = list(data.select_dtypes('object').columns)
num_col_names = list(data.select_dtypes('float64').columns)
  1. We have discussed the five configuration steps, below we define all those configuration settings and bind those inside the TabularModule.
data_config = DataConfig(
    target=['target'],     
    continuous_cols=num_col_names,
    categorical_cols=cat_col_names,
)
trainer_config = TrainerConfig(
    auto_lr_find=True, 
    batch_size=1024,
    max_epochs=100,
    gpus=-1, 
)
optimizer_config = OptimizerConfig()
 
model_config = CategoryEmbeddingModelConfig(
    task="classification",
    layers="1024-512-512",  
    activation="LeakyReLU", 
    learning_rate = 1e-3
)
 
tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)
  1. Now that the configs and TabularModel have been defined, all we have to do now is run the fit method and pass the train and test data frames as parameters. Validation data frames can also be passed in. If this option is not selected, TabularModel will randomly select 20% of the data as validation (also customizable).

tabular_model.fit(train=train, validation=val)

  1. Now let’s try to predict the test dataset and observe the accuracy and classification report as it is a multi-class classification problem.
pred_df = tabular_model.predict(test)
print_metrics(test['target'], pred_df["prediction"], tag="Holdout")

Final Words

In this article, we discussed a unified and simple API for tabular data, akin to what Scikit Learn has done for traditional machine learning techniques, such as PyTorch Tabular. We have gone over what PyTorch Tabular is and how it works in this post, as well as how to use it.

References

More Great AIM Stories

Vijaysinh Lendave
Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.
MORE FROM AIM
Yugesh Verma
How is Boolean algebra used in Machine learning?

Machine learning model with Boolean algebra starts with the data with a target variable and input or learner variables and using the set of rules it generates output value by considering a given configuration of input samples.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM