Introduced in the 1970s, Hopfield networks were popularised by John Hopfield in 1982. Hopfield networks, for the most part of machine learning history, have been sidelined due to their own shortcomings and introduction of superior architectures such as the Transformers (now used in BERT, etc.).
Co-creator of LSTMs, Sepp Hochreiter with a team of researchers, have revisited Hopfield networks and came up with surprising conclusions. In a paper titled, ‘Hopfield networks Is All You Need’, the authors introduce a couple of elements that make Hopfield networks interchangeable with the state-of-the-art transformer models.
What’s New About Hopfield Networks
The above figure depicts the relation between binary modern Hopfield networks — the new Hopfield network has continuous states, a new update rule, and the transformer.
The standard binary Hopfield network has an energy function that can be expressed as the sum of interaction functions F with F(x) = x^2. Modern Hopfield networks called “dense associative memory” (DAM) models use an energy function with interaction functions of form F(x) = x^n and, thereby, achieve a storage capacity proportional to d^(n−1).
The main contributions of the paper can be summarised as follows:
1| Introduction of a new energy function using the log-sum-exp function
2| The state ξ is updated by the following new update rule:
3| The new energy function offers the following,
- Global convergence to a local minimum
- Exponential storage capacity
- Convergence after one update step
In this work, the authors have also provided a new PyTorch layer called “Hopfield” which allows equipping deep learning architectures with modern Hopfield networks as new powerful concepts comprising pooling, memory, and attention.
Why Use Them At All
“The modern Hopfield network gives the same results as the SOTA Transformer.”
The modern Hopfield networks were put to use by Hochreiter and his colleagues to find patterns in the immune repertoire of an individual. Their network called DeepRC, implements, what the researchers call, ‘a transformer like a mechanism’, which is nothing but the modern Hopfield networks.
The re-emergence of the once outdated Hopfield networks has created ripples within the ML community.
In one of the popular forums, the enthusiasts queried why should anyone bother replacing the attention layer with that of Hopfield. “Am I correct that they are theoretically exactly the same operation, and there is no benefit to switching?”
To which, one of the authors of the original paper responded by saying that there is no reason to replace the transformer implementations with Hopfield layers. However, the Hopfield layer is more general. One can do multiple updates, can adjust the parameter, have static queries etc. Most importantly, the Hopfield interpretation allows one to gain new insights into the working of transformers, characterised by the kind of fixed points.
Moreover, the Hopfield layer can be integrated flexibly in arbitrary deep network architectures, which the author thinks can open up new possibilities.
Regarding the computational gains with Hopfield networks, the researcher wrote that the Hopfield layer could be seen as a stand-alone module. That said, if one wants to replace a pooling layer, then the Hopfield layer would require more compute compared to that of replacing an LSTM layer.
Attention heads lie at the heart of successes such as BERT and other language models. And, to figure out that an almost obscure technique such as Hopfield network is now on par with state-of-the-art models is nothing short of a miracle. The researchers hope that this successful demonstration will encourage others to revisit the fundamental methods of hiding in plain sight.
Read the original paper here.