Listen to this story
|
One can say that modern AI, or generative AI, is running on attention or Transformers that were created by Google. Seven years after the paper was released, everyone is still trying to find better architectures for AI. But arguably, even after all the backlash, Transformers still reign supreme.
Noam Shazeer, one of the creators of Transformers, revealed that Transformer architecture was once called ‘CargoNet,’ but nobody really paid much attention to it.
Regardless, researchers challenging Transformers is not new. The latest paper by Sepp Hochreiter, the inventor of LSTM, has unveiled a new LLM architecture featuring a significant innovation: xLSTM, which stands for Extended Long Short-Term Memory. The new architecture addresses a major weakness of previous LSTM designs, which were sequential in nature and unable to process all information at once.
LSTMs, compared to Transformers, are limited by their storage capacities, inability to revise storage decisions, and lack of parallelisability due to memory mixing. Unlike LSTMs, Transformers parallelise operations across tokens, significantly improving efficiency.
The main components of the new architecture include a matrix memory for LSTM, eliminating memory mixing, and exponential gating. These modifications allow the LSTM to revise its memory more effectively when processing new data.
What Are the Problems With Transformers?
In December last year, researchers Albert Gu and Tri Dao from Carnegie Mellon and Together AI introduced Mamba, challenging the prevailing dominance of Transformers.
Their research unveiled Mamba as a state-space model (SSM) that demonstrates superior performance across various modalities, including language, audio, and genomics. For example, the researchers tried their language modelling with the Mamba-3B model, which outperformed Transformer-based models of the same size and matched Transformers twice its size, both in pretraining and downstream evaluation.
The researchers emphasised Mamba’s efficiency through its selective SSM layer, which is designed to address the computational inefficiency of Transformers on long sequences up to a massive million sequence length, a major limitation in Transformers.
Another paper by the Allen Institute of AI titled “Faith and Fate: Limits of Transformers on Compositionality” discussed the fundamental limits of Transformer language models by focusing on compositional problems that require multi-step reasoning.
The study investigates three representative compositional tasks: long-form multiplication, logic grid puzzles (e.g. Einstein’s puzzle), and a classic dynamic programming problem.
The autoregressive nature of Transformers presents a fundamental challenge in understanding tasks comprehensively. These findings underscore the pressing need for advancements in Transformer architecture and training methods.
Maybe Attention is a Good Start
According to Meta’s AI chief Yann LeCun, “Auto-regressive LLMs are like processes that keep getting away from the correct answers exponentially”.
This is possibly why Meta also introduced MEGALODON, a neural architecture for efficient sequence modelling with unlimited context length. It is designed to address the limitations of Transformer architecture in handling long sequences, including quadratic computational complexity and limited inductive bias for length generalisation.
This is similar to Google introducing Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations, fostering the emergence of working memory within the Transformer and allowing it to process indefinitely long sequences.
In April, Google also unveiled a new family of open-weight language model RecurrentGemma 2B, by Google DeepMind, based on the novel Griffin architecture.
This architecture achieves fast inference when generating long sequences by replacing global attention with a mixture of local attention and linear recurrences.
Speaking of mixture, Mixture of Experts (MoE) models are also on the rise. It is a type of neural network architecture that combines the strengths of multiple smaller models, known as ‘experts’, to make predictions or generate outputs. An MoE model is like a team of hospital specialists. Each specialist is an expert in a specific medical field, such as cardiology, neurology, or orthopaedics.
With respect to Transformer models, MoE has two key elements – Sparse MoE Layers and a Gate Network. Sparse MoE layers represent different ‘experts’ within the model, each capable of handling specific tasks. The gate network functions like a manager, determining which words or tokens are assigned to each expert.
This has led to Jamba. AI21 Labs introduced Jamba, which is a hybrid decoder architecture that combines Transformer layers with Mamba layers, along with an MoE module. The company refers to this combination of three elements as a Jamba block.
Jamba applies MoE at every other layer, with 16 experts, and uses the top 2 experts at each token. “The more the MoE layers, and the more the experts in each MoE layer, the larger the total number of model parameters,” wrote AI21 Labs in Jamba’s research paper.
The End of Transformers?
Before Transformers were all the hype, people were obsessed with recurrent neural networks (RNNs) for deep learning. But by definition, RNNs process data sequentially, which was thought to be an unfit choice for text-based models.
But Transformers are also just a modification of RNNs with an added attention layer. This could be the same for something that “replaces” Transformers.
Jensen Huang had asked the panel at NVIDIA GTC 2024 about the most significant improvements to the base Transformer design. Aidan Gomez replied that extensive work has been done on the inference side to speed up these models. However, Gomez said that he is quite unhappy with the fact that all developments happening today are built on top of Transformers.
“I still think it kind of disturbs me how similar to the original form we are. I think the world needs something better than the Transformer,” he said, adding that he hopes it will be succeeded by a ‘new plateau of performance’. “I think it is too similar to the thing that was there six or seven years ago.”