A Beginner’s Guide To Attention And Memory In Deep Learning

It might have never occurred to you how you could make sense of what your friend is blabbering at a loud party. There are all kinds of noises in a party; then how come we are perfectly able to carry out a conversation? This question is known widely as the ‘cocktail party problem’. Most of our cognitive processes can pay attention to only a single activity at a time. In the case of a party house, our capability of directing attention towards one set of words while ignoring other sets of words, which are often overpowering, is still a conundrum. 

The key cognitive processes to solve a cocktail party problem are attention and short-term memory. This concept of attention and memory is a recurring theme in most of our daily processes. Being able to figure out what one sees or hears by just paying attention to chunks of information rather than the whole string of it, helps one react in real time. In short, it doesn’t consume too much memory. Processing and storage are also a huge challenge when it comes to computational systems. 

For machines, it’s all 0s and 1s and still, the concept of attention as we know it can be incorporated into them with well crafted algorithms. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Attention and memory have emerged as two vital new components of deep learning over the last few years. 

The ability to focus on one thing and ignore others has a vital role in guiding cognition. Not only does this allow us to pick out salient information from noisy data (cocktail party problem), it also allows us to pursue one thought at a time, remember one event rather than all events.

Attention in terms of neural networks can be thought of as a vector with important weights. The weights tell the network where to look at (attention), and which pixel in an image or a word in a sentence to look at. The attention vector is used to predict how strongly it is related to other elements and approximate the target.

Attention And Memory In Machine Learning

via Alex Graves, DeepMind

Deep nets naturally learn a form of implicit attention where they respond more strongly to some parts of the data than others. 

RNNs contain a recursive hidden state and learn functions from sequences of inputs (e.g. a speech signal) to sequences of outputs (e.g. words). The underlying task here is to calculate a sequential Jacobian of an event. 

The sequential Jacobian, which is a set of derivatives,  shows which past inputs they remember when predicting current outputs. It is represented as follows:

Where x and y denotes the input and output vectors respectively. 

The network produces an extra output vector used to parametrise an attention model. The attention model then operates on some data (image, audio sample, text to be translated…) to create a fixed-size “glimpse” vector that is passed to the network as input at the next time step. The complete system is recurrent, even if the network isn’t. 

We can elaborate this further by taking the example of the famous single headed attention RNN model(SHA-RNN) by Stephen Merity.

The model consists of a trainable embedding layer, one or more layers of a stacked single head attention recurrent neural network (SHA-RNN), and a softmax classifier. The model uses a single head of attention and a modified feedforward layer similar to that in a Transformer, which is referred to as a Boom layer. Boom layer takes a vector from small (1024) to big (4096) to small (1024). 

The Boom layer is related strongly to the large feed-forward layer found in Transformers and other architectures. 

This layer minimizes computation and removes an entire matrix of parameters compared to traditional down-projection layers.

Attention models generally work by defining a probability distribution over glimpses of the data given some set of attention outputs from the network.

Also Read

There are different types of attentions:

  • Self-Attention
  • Selective Attention
  • Introspective Attention
  • Differentiable Visual Attention
  • Associative Attention

Applications of employing attention and memory modules into the network:

  • Transformers, the models that have revolutionised NLP use attention mechanisms. When released in 2017, it set new benchmarks for machine translations. This innovation even led to better models such as BERT, GPT and many more. 
  • Hand writing synthesis with RNNs
  • In Differential Neural Computers

Things To Remember:

  • Selective attention appears to be as useful for deep learning as it is for people
  • We can use attention to attend to memory as well as directly to data
  • Many types of attention mechanism (content, spatial, visual, temporal…) can be defined
  • Attention mechanisms revolutionised language modeling(ex:transformers)

Also Watch:

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry


Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox