The machine learning community this week was offered a crash course on how to write a paper by Stephen Merity (fondly called Smerity), a deep learning researcher and a Harvard graduate, via his paper on recurrent neural networks.
“Stop Thinking With Your Head”, announced Smerity in the title of his work, preparing the readers for a fun-filled ride.
In this work, the author investigates the current state of natural language processing, the models being used and other alternate approaches. In this process, he tears down the conventional methods from top to bottom, including etymology.
He not only critiques the existing methods but also provides a simplistic yet effective way to train models using minimal resources.
The model named SHA-RNN or single-headed attention RNN, is composed of an RNN, pointer-based attention, and a “Boom” feed-forward with a sprinkling of layer normalization. The persistent state is the RNN’s hidden state as well as the memory concatenated from previous memories.
With this work, the author tries to answer two questions:
- How many attention heads does one need and why is there so much talk about attention layer?
- What are the benefits of attention heads?
Overview Of SHA-RNN
The model consists of a trainable embedding layer, one or more layers of a stacked single head attention recurrent neural network (SHA-RNN), and a softmax classifier.
Language modelling is one of the foundational tasks of natural language processing. The task involves predicting the (n + 1)th token in a sequence given the n preceding tokens.
Many of the complexities and processes in the world can be rewritten into a language. The model architecture is an upgrade of the AWD-LSTM.
Almost any word piece tokenization will split prefixes and suffixes from various words for example and result in compositionally more balanced word fragments. Mixed with teacher forcing, which is present at both training time and test time, this could have quite a profound impact
The model uses a single head of attention and a modified feedforward layer similar to that in a Transformer, which is referred to as a Boom layer. Boom layer takes a vector from small (1024) to big (4096) to small (1024).
The Boom layer is related strongly to the large feed-forward layer found in Transformers and other architectures.
This layer minimizes computation and removes an entire matrix of parameters compared to traditional down-projection layers.
The code is a tad horrific. I’ll admit to that.
All the work was done a single GPU. “Because I didn’t want my dollars to leave the bank,” quipped the author.
The author also explained the rationale behind his decision to give with single GPU by confessing that all his best works came with limited resources. He also believes it to be the right strategy to have an open mind in the long run of any project.
For the experiments done to validate the model, the author used The Hutter Prize Wikipedia dataset (Hutter, 2018), also known as enwik8. This dataset is a byte-level dataset consisting of the first 100 million bytes of a Wikipedia XML dump. The train, validation and test sets consist of the first 90M, 5M, and 5M characters, respectively.
Rocking The Boat Of Buzzwords
Right from the beginning of the paper, it has been obvious that the author was not impressed with the way acronyms were being dished out across the machine learning community.
Explaining his idea behind naming it as a boom layer, Smerity reveals that It’s really not that hard to visualize –
Use your hands if you need to whilst shouting “boooOOOOmmm”
Yes, it’s as straightforward as it sounds.
Language holds clues to the way the mind works and since machine learning attempts to crack that code, it is of utmost importance that language should be studied from all possible ways and not be satiated with few working models.
The author, in his paper, complains that by putting more work in to inflexibly defining a machine learning model, we’ve accidentally handicapped it.
The author’s lone goal is to show that the entire field might have evolved in a different direction if we had instead been obsessed with a slightly different acronym and a slightly different result.
Key Takeaways From The Paper
- The author shows machine learning (NLU) would have evolved in a different direction if we had instead been obsessed with a slightly different acronym.
- The final results are achievable in plus or minus 24 hours on a single GPU.
- The attention mechanism is also readily extended to large contexts with minimal computation.
- It’s okay to make research more fun while being informative. This will scare away fewer people from the core research topics.
In all seriousness, the author voices the need for a Moore’s Law for machine learning that encourages a minicomputer future, not a mainframe one. He also assures that he plans on rebuilding the codebase from the ground up both as an educational tool for others and as a strong platform for future work in academia and industry.