There are a major differences between functioning of neural networks and our brain. The biggest difference is memory. In the 1980’s, neural networks faced criticisms that it works with only fixed-size inputs and that they are not able to bind values to specific locations in a data structure (store). Ability to read and write from memory is critical and both the computer and the brain can do it, but not with Neural networks. Researchers have identified that this key difference might be the major roadblock for today’s AI systems to reach human level intelligence.
With this in mind, Alex Graves and fellow researchers at DeepMind aimed to build a differentiable computer, which could put together a neural network and link it to external memory. The neural network would act like a CPU and with memory attached, the aim would be to learn programs (algorithms) from input and output examples.
Working Memory and Neural Networks
It is critical that neural networks remember some information to do more meaningful tasks, which resulted in the creation of recurrent neural networks (RNNs). These kind of neural networks can process variable-size inputs by adding a time dimension to the data. But RNNs really do not address the whole issue as they dont work with external memory. The Neural Turing Machine (NTM) tries to bind values to specific locations by designing neural networks that has an external memory.
It is desirable that intelligent systems have an external memory attached to it. Research has shown that we could have a model of working memory (also known as short term memory) that assists neural networks. The brain has a working memory which can be used to fetch and write data. The computer traditionally have cache which is temporary memory to quickly access information that can be called working memory.
Working memory is essentially a memory area with a limited capacity that is responsible for temporarily holding information available for processing. The research uses many other works that have studied memory from a computational neuroscience perspective. One example being the research carried out by A. Baddeley and fellow researchers in 2009 found that a “central executive” in the brain focuses attention and performs operations on data in a memory buffer.
Architecture of Neural Turing Machine
The NTM is built up using a neural network, called controller, and a 2D matrix called the memory bank or memory matrix. The NTM is largely inspired by the Turing machine. A Turing machine, invented by Alan Turing, is a mathematical model of computation that defines an abstract machine. This model can manipulates symbols on a strip of tape according to a table of rules.
NTM, like the CPU has the ability to read and write to memory. Like ordinary neural networks, the controller of the NTM interacts with external world with input and output vectors. But the special feature of NTM is that it can talk to external memory via read and write operations. By the Turing Machine analogy, the decision as to which address of memory NTM will interact with, is taken by “heads”. In the below image, the dotted line demarcates which parts of the architecture are “inside” the system as opposed to the outside world.

The memory is indexed (addressed) by the row and the column. The NTM is trained by an optimisation method such as stochastic gradient descent on backpropagation. The way NTM learns is that the controller produces weightings (vectors) over memory locations, of which we can calculate gradients.
Theory behind Neural Turing Machines
The NTM works on an attention model to understand from where to read and write. It uses the controller output to parameterise a distribution (called “weightings”) over the rows (locations) in the matrix. The weightings are defined by two main attention mechanisms: one based on content and one based on location.
Reading
The memory matrix is indexed with R rows and C elements per row at time T as Mt. An attention mechanism that tells the system from where to read. The attention mechanism will be a length-R normalised weight vector wt.
Writing
Writing is harder than reading, since writing involves two separate steps: erasing, and then adding. To erase old data, a write head uses a new vector, the length-C erase vector et, in addition to length-R normalised weight vector wt. The erase vector with the weight vector helps us specify which elements in a row should be affected.
Addressing
To get the weight vectors that decide how to read and write is very difficult. Each stage creates an intermediate weight vector that is passed. The first stage’ goal is to generate a weight vector based on how similar each row in memory is to length-C vector kt emitted by controller. The intermediary weight vector wt is the content weight vector. This vector allows the controller to select values similar to previously seen values. This is content-based addressing. The similarity measure used in NTM is cosine similarity.

Another positive scalar parameter, called key strength, is used to determine the concentration of content weight vector. This is an approach for content based approaches. Similarly there is also a provision for implementing memory (location) based addressing.
Experiments with NTM
The aim of the NTM is basically to learn programs given input and output data. Based on this, Alex Graves and other researchers built several experiments to test the capability of NTM.
Priority Sort is one such experiment, that was carried out to learn how to sort data. 20 binary vectors were each given a scalar “priority rating” drawn uniformly from the range [-1,1], and the target was to return the 16 highest priority vectors in the input. The NTM turned out to be successful in using priorities to approximate where each vector should be stored in order. Thus, to produce the 16 highest priority vectors, the memory locations were read sequentially.
