MITB Banner

How Can Memory Augmentation Work Wonders For Large Scale NLP Tasks

Share

Current machine learning models that are deployed for vision and in natural language processing(NLP) tasks have more than a billion parameters. This allows for better results as the model generalizes over a large wide range of parameters. But there is a catch, as the capacity increases, the computation complexity increases.

The ability to increase the number of parameters while keeping the same computational budget allows the overall system to strike a better trade-off between prediction accuracy and computation efficiency both at training and test time.

The memory is very large by design and therefore significantly increases the capacity of the architecture, by up to a billion parameters with negligible computational overhead. To address these challenges, a paper was introduced which proposes a structured memory which can be easily integrated into a neural network. 

This new layer is purported to tackle problems in areas where existing architectures underfit in the presence of a vast amount of available data, or in case of slow work in practice.

Overview Of Key-Value Memory Layer 

In a key-value memory layer, usually, the input ‘x’ is processed through a query network that produces a query vector ‘q’, which is compared to all the keys. The output is the sparse weighted sum over the memories associated with the selected keys.

The authors define the keys as the concatenation of two sub-keys. These large number of keys can be thought of as memory slots. Despite a large number of memory slots, finding the exact closest keys to the input is very efficient, typically requiring O(√|K|) vector comparisons, where |K| is the total number of memory slots.

Only a handful of memory slots are updated for each input at training time since sparsity of key selection and parameter updates make both training and inference very efficient.

To validate their claims, the authors experimented with the widely popular BERT(Bidirectional Encoder Representations from Transformers) and Generative pre-training transformer(GPT-2). The attempt here is to integrate the memory within these transformer architectures.

Augmenting Large Scale language Models

BERT and GPT-2 were selected because of their success in proving that increasing the capacity of large models directly translates to large improvements in language modelling, which in turn translates to better performance in both language understanding tasks and text generation.

The transformer network is the current workhorse of Natural Language Processing (NLP) and is built by stacking blocks composed of self-attention layers followed by fully connected layers (dubbed FFN). 

The components of the memory layer bear similarities to the query, key and value networks used in these self-attention layers with two notable differences: 

  • the keys and values do not correspond to input tokens but are free embedding vectors, and 
  • the number of values (memory size) is very large.

This work borrows some ideas from product quantization (PQ), which is an approximate search technique that maps database vectors into compact codes. And, also exploit the idea to represent a large set of key vectors by a drastically smaller number of vectors, that is updated by regular back-propagation.

The training set used for the experiment was composed of 28 billion words (140 GB of data) extracted from about 40 million English news articles indexed by Common Crawl corpora.  The validation and test sets are both composed of 5000 news articles removed from the training set.

The authors found that it is beneficial to set  Adam learning rate at 10 ^(-3). Models were implemented with PyTorch and trained on 32 Volta GPUs.

Key Takeaways

This work is an attempt to:

  • Propose a new layer(key-value memory) that allows to drastically improve the capacity of a neural network with negligible computational overhead.
  • Provide results that show important gains on large-scale language modelling, reaching with 12 layers the performance of a 24-layer BERT-large model with half the running time.
  • Demonstrate why adding memory to the model is more beneficial than increasing the number of layers.

Know more about the work here.

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.