MITB Banner

A Beginner’s Guide To Attention And Memory In Deep Learning

Share

It might have never occurred to you how you could make sense of what your friend is blabbering at a loud party. There are all kinds of noises in a party; then how come we are perfectly able to carry out a conversation? This question is known widely as the ‘cocktail party problem’. Most of our cognitive processes can pay attention to only a single activity at a time. In the case of a party house, our capability of directing attention towards one set of words while ignoring other sets of words, which are often overpowering, is still a conundrum. 

The key cognitive processes to solve a cocktail party problem are attention and short-term memory. This concept of attention and memory is a recurring theme in most of our daily processes. Being able to figure out what one sees or hears by just paying attention to chunks of information rather than the whole string of it, helps one react in real time. In short, it doesn’t consume too much memory. Processing and storage are also a huge challenge when it comes to computational systems. 

For machines, it’s all 0s and 1s and still, the concept of attention as we know it can be incorporated into them with well crafted algorithms. 

Attention and memory have emerged as two vital new components of deep learning over the last few years. 

The ability to focus on one thing and ignore others has a vital role in guiding cognition. Not only does this allow us to pick out salient information from noisy data (cocktail party problem), it also allows us to pursue one thought at a time, remember one event rather than all events.

Attention in terms of neural networks can be thought of as a vector with important weights. The weights tell the network where to look at (attention), and which pixel in an image or a word in a sentence to look at. The attention vector is used to predict how strongly it is related to other elements and approximate the target.

Attention And Memory In Machine Learning

via Alex Graves, DeepMind

Deep nets naturally learn a form of implicit attention where they respond more strongly to some parts of the data than others. 

RNNs contain a recursive hidden state and learn functions from sequences of inputs (e.g. a speech signal) to sequences of outputs (e.g. words). The underlying task here is to calculate a sequential Jacobian of an event. 

The sequential Jacobian, which is a set of derivatives,  shows which past inputs they remember when predicting current outputs. It is represented as follows:

Where x and y denotes the input and output vectors respectively. 

The network produces an extra output vector used to parametrise an attention model. The attention model then operates on some data (image, audio sample, text to be translated…) to create a fixed-size “glimpse” vector that is passed to the network as input at the next time step. The complete system is recurrent, even if the network isn’t. 

We can elaborate this further by taking the example of the famous single headed attention RNN model(SHA-RNN) by Stephen Merity.

The model consists of a trainable embedding layer, one or more layers of a stacked single head attention recurrent neural network (SHA-RNN), and a softmax classifier. The model uses a single head of attention and a modified feedforward layer similar to that in a Transformer, which is referred to as a Boom layer. Boom layer takes a vector from small (1024) to big (4096) to small (1024). 

The Boom layer is related strongly to the large feed-forward layer found in Transformers and other architectures. 

This layer minimizes computation and removes an entire matrix of parameters compared to traditional down-projection layers.

Attention models generally work by defining a probability distribution over glimpses of the data given some set of attention outputs from the network.

Also Read

There are different types of attentions:

  • Self-Attention
  • Selective Attention
  • Introspective Attention
  • Differentiable Visual Attention
  • Associative Attention

Applications of employing attention and memory modules into the network:

  • Transformers, the models that have revolutionised NLP use attention mechanisms. When released in 2017, it set new benchmarks for machine translations. This innovation even led to better models such as BERT, GPT and many more. 
  • Hand writing synthesis with RNNs
  • In Differential Neural Computers

Things To Remember:

  • Selective attention appears to be as useful for deep learning as it is for people
  • We can use attention to attend to memory as well as directly to data
  • Many types of attention mechanism (content, spatial, visual, temporal…) can be defined
  • Attention mechanisms revolutionised language modeling(ex:transformers)

Also Watch:

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.