A combined team from Facebook AI Research and Georgia Institute of Technology has come up with a new approach, known as Tensor Train decomposition for DLRMs (TT-Rec), to compress the size of deep learning recommendation models by up to 112 times.
Deep neural networks (DNNs) are applied across domains such as predictive forecasting, medical diagnosis, autonomous driving, natural language etc. The capacity of the embedding tables in deep learning recommendation models (DLRMs) is increasing dramatically as the efficiency of the models improves.
Why This Research
Over the years, DNNs have made great strides in several dimensions, including the size of data, cost of infrastructure for training and deployment of the model, model complexity, among others. For instance, OpenAI’s GPT-3 comprises 175 billion parameters. Also, Facebook saw an eight-fold increase in the amount of computation required for machine learning model training in one year (2019-2020). Such unprecedented growth in dimensions results in costly and complex models. The researchers have developed an algorithmic approach to deal with the large memory requirement of DNNs.
Tech Behind The Model
Deep learning-based recommendation models (DLRMs) are one of the most resource-demanding deep learning workloads. According to the researchers, the large embedding tables in recommendation models contribute to 99 percent of the total recommendation model capacity.
To that end, the researchers have used a method known as tensorization to tackle the large memory capacity demand of embedding tables in a DLRM. At a high level, tensorization works by replacing a neural network’s layers with an approximate and structured low-rank form. However, the form is parametric as its shape determines the design trade-off between storage capacity, execution time, and model accuracy.
The researchers have designed the Tensor-Train compression technique for deep learning Recommendation models, known as TT-Rec. TT-Rec is based on the idea of replacing large embedding tables in a DLRM with a sequence of matrix products. “TT-Rec uses a hybrid approach to learn features and deliver on-par model accuracy while requiring orders-of-magnitude less memory capacity,” the researchers said.
The above figure depicts the generalised model architecture for DLRMs. The model has two primary components, Multi-Layer Perceptron (MLP) layer modules & Embedding Tables (EMBs). The MLP layers process continuous features, such as user age, while the EMBs process categorical features by encoding sparse, high-dimensional inputs into a dense, vector representation. TT-Rec customises the TT-decomposition method to compress embedding tables in deep learning recommendation models.
- The research applied tensor-train compression in a new application context, compressing the embedding layers of deep learning recommendation models (DLRMs).
- The researchers quantified the potential trade-off between memory requirements and accuracy.
- To recover accuracy loss, researchers proposed a sampled Gaussian distribution for the weight initialisation of the tensor cores. To accelerate TT-Rec’s training performance, they introduced a separate cache structure to store frequently-accessed embedding vectors in the uncompressed format, which empirically helps in accuracy improvement.
- TTRec achieved a higher model accuracy rate with an increase of 10 percent in the training time on average. The approach also reduces the size of the total memory requirement of the embedding tables by up to 112 times.
Benefits of TT-Rec
- TT-Rec provides a flexible design space between memory capacity, training time and model accuracy.
- TT-Rec is a highly effective approach, especially for online recommendation training.
- According to the researchers, the orders-of-magnitude lower memory requirement with TT-Rec also unlocks many modern AI training accelerators for DLRM training.
- TT-Rec suits accelerators like GPUs with a relatively higher compute-to-memory (FLOPs-per-Byte) ratio and limited memory capacity.
The research demonstrated significant compression ratios and improved training time performance of the DLRMs, including a judicious design and parameterisation of the tensor-train compression technique.
Read the paper here.