In tabular data, deep learning has traditionally lagged behind the popular Gradient Boosting in terms of popularity and performance. However, newer models developed expressly for tabular data, such as XBNet, have recently pushed the performance bar. In this post, we will look into PyTorch Tabular, a framework designed specifically for tabular data with an intention to make deep learning with tabular data easy and accessible to real-world cases. We will discuss how this framework was made, what design principles it follows, and how it can be applied. The major points to be discussed in this article are listed below.
Table of Contents
- The PyTorch Tabular
- Design of Library
- Implementing PyTorch Tabular
Let’s start the discussion by knowing the creation of the framework.
The PyTorch Tabular
PyTorch Tabular is a framework for deep learning using tabular data that aims to make it simple and accessible to both real-world applications and academics. The following are the design principles for the library:
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
- Low resistance and usability
- Customization is simple
- Scalable and easier to set up
PyTorch Tabular aims to make dealing with Neural Networks’ software engineering as straightforward and painless as possible, enabling you to focus on the model. It also aims to bring together the many breakthroughs in the Tabular sector into a single framework with a common API that can be used with a variety of cutting-edge models. It also comes with a base model that can be readily customized to help Deep Learning researchers create new tabular data architectures.
The PyTorch Tabular stands on the shoulders of titans such as PyTorch, PyTorch Lightning, and Pandas.

Design of Library
PyTorch Tabular is intended to make the standard modeling pipeline simple enough for practitioners while also being reliable enough for production use. It also focuses on customization so that it can be used in a variety of research settings. The below picture depicts the structure of the framework.

Now let’s briefly discuss all the modules from the framework. We first start with Configuration Modules.
Data Config
DataConfig is where we define the parameters for how we will manage data within the pipeline. This configuration differentiates between categorical and continuous features, determines normalization or feature transformations, and so on.
Model Config
For each model implemented in the PyTorch Tabular, a new ModelConfig is defined. It derives from a base ModelConfig that contains common parameters such as job (classification or regression), learning rate, loss, metrics, and so forth. Each model developed inherits these parameters and adds model-specific hyperparameters to the configuration. PyTorch Tabular automatically initializes the correct model by selecting the matching ModelConfig.
Trainer Config
TrainerConfig manages all of the parameters that control your training, with the PyTorch Lightning layer receiving the majority of them. Batch size, max epochs, early stopping, and other parameters can be set here.
Optimizer Config
Another important aspect of training a neural network is optimizers and learning rate schedules. The OptimizerConfig can be used to make these changes.
Experimental Config
Experiment tracking is practically a requirement of machine learning. It’s crucial for maintaining reproducibility. Internally, the PyTorch Tabular recognizes this and provides experiment tracking. Tensorboard and Weights & Biases are the two experiment tracking frameworks that PyTorch Tabular currently supports.
Base Model
PyTorch Tabular makes use of the abstract BaseModel class, which implements the standard parts of any model definition, such as loss and metric calculation, and so on. This class acts as a foundation for any other model and guarantees that the model and the training engine work together seamlessly. The model initialization component and the forward pass are the only two methods that a new model must implement if it inherits this class.
Data Module
The Data Module, as specified by PyTorch Lightning, is used by PyTorch Tabular to unify and standardize data processing. It includes preprocessing, label encoding, category encoding, feature transformations, target transformations, and other data processing, as well as ensuring that the same data processing is performed to train and validate splits, as well as fresh and unseen data. PyTorch data loaders are provided for training and inference.
Implementing PyTorch Tabular
In this section, we will implement the framework with the support of the SK-Learn module for the dataset supplies and evaluation metrics.
Install all the PyTorch Tabular with its core functionality using the pip as
! pip install PyTorch_tabular[all]
- Import the dependencies
from PyTorch_tabular import TabularModel from PyTorch_tabular.models import CategoryEmbeddingModelConfig from PyTorch_tabular.config import DataConfig, OptimizerConfig, TrainerConfig, ExperimentConfig from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report import random import numpy as np import pandas as pd import os
- Below we will write a function to evaluate the network and will load the data.
# Function to evaluate network def print_metrics(y_true, y_pred, tag): if isinstance(y_true, pd.DataFrame) or isinstance(y_true, pd.Series): y_true = y_true.values if isinstance(y_pred, pd.DataFrame) or isinstance(y_pred, pd.Series): y_pred = y_pred.values if y_true.ndim>1: y_true=y_true.ravel() if y_pred.ndim>1: y_pred=y_pred.ravel() val_acc = accuracy_score(y_true, y_pred) val_f1 = classification_report(y_true, y_pred) print(f"{tag} Acc: {val_acc} | {tag} Classification Report \n: {val_f1}")
# prepare data in the form that framework accepts data = load_digits() file1 = pd.DataFrame(data.data,columns=data.feature_names) file2 = pd.DataFrame(data.target, columns=['target']) data = pd.concat([file1,file2],axis=1) cat_col_names = list(data.select_dtypes('object').columns) num_col_names = list(data.select_dtypes('float64').columns)
- We have discussed the five configuration steps, below we define all those configuration settings and bind those inside the TabularModule.
data_config = DataConfig( target=['target'], continuous_cols=num_col_names, categorical_cols=cat_col_names, ) trainer_config = TrainerConfig( auto_lr_find=True, batch_size=1024, max_epochs=100, gpus=-1, ) optimizer_config = OptimizerConfig() model_config = CategoryEmbeddingModelConfig( task="classification", layers="1024-512-512", activation="LeakyReLU", learning_rate = 1e-3 ) tabular_model = TabularModel( data_config=data_config, model_config=model_config, optimizer_config=optimizer_config, trainer_config=trainer_config, )
- Now that the configs and TabularModel have been defined, all we have to do now is run the fit method and pass the train and test data frames as parameters. Validation data frames can also be passed in. If this option is not selected, TabularModel will randomly select 20% of the data as validation (also customizable).
tabular_model.fit(train=train, validation=val)
- Now let’s try to predict the test dataset and observe the accuracy and classification report as it is a multi-class classification problem.
pred_df = tabular_model.predict(test) print_metrics(test['target'], pred_df["prediction"], tag="Holdout")

Final Words
In this article, we discussed a unified and simple API for tabular data, akin to what Scikit Learn has done for traditional machine learning techniques, such as PyTorch Tabular. We have gone over what PyTorch Tabular is and how it works in this post, as well as how to use it.