AI Researcher Smerity Rips Apart Traditions, Coins The Term ‘Boooom Layer’ Because Why Not

The machine learning community this week was offered a crash course on how to write a paper by Stephen Merity (fondly called Smerity), a deep learning researcher and a Harvard graduate, via his paper on recurrent neural networks. 

“Stop Thinking With Your Head”, announced Smerity in the title of his work, preparing the readers for a fun-filled ride.

In this work, the author investigates the current state of natural language processing, the models being used and other alternate approaches. In this process, he tears down the conventional methods from top to bottom, including etymology.

He not only critiques the existing methods but also provides a simplistic yet effective way to train models using minimal resources.

The model named SHA-RNN or single-headed attention RNN, is composed of an RNN, pointer-based attention, and a “Boom” feed-forward with a sprinkling of layer normalization. The persistent state is the RNN’s hidden state as well as the memory concatenated from previous memories.

With this work, the author tries to answer two questions:

  1. How many attention heads does one need and why is there so much talk about attention layer?
  2. What are the benefits of attention heads?

Overview Of SHA-RNN

via Smerity

The model consists of a trainable embedding layer, one or more layers of a stacked single head attention recurrent neural network (SHA-RNN), and a softmax classifier.

Language modelling is one of the foundational tasks of natural language processing. The task involves predicting the (n + 1)th token in a sequence given the n preceding tokens.

Many of the complexities and processes in the world can be rewritten into a language. The model architecture is an upgrade of the AWD-LSTM.

Almost any word piece tokenization will split prefixes and suffixes from various words for example and result in compositionally more balanced word fragments. Mixed with teacher forcing, which is present at both training time and test time, this could have quite a profound impact

The model uses a single head of attention and a modified feedforward layer similar to that in a Transformer, which is referred to as a Boom layer. Boom layer takes a vector from small (1024) to big (4096) to small (1024). 

The Boom layer is related strongly to the large feed-forward layer found in Transformers and other architectures. 

This layer minimizes computation and removes an entire matrix of parameters compared to traditional down-projection layers.

The code is a tad horrific. I’ll admit to that.

All the work was done a single GPU. “Because I didn’t want my dollars to leave the bank,” quipped the author.

The author also explained the rationale behind his decision to give with single GPU by confessing that all his best works came with limited resources. He also believes it to be the right strategy to have an open mind in the long run of any project.

For the experiments done to validate the model, the author used The Hutter Prize Wikipedia dataset (Hutter, 2018), also known as enwik8. This dataset is a byte-level dataset consisting of the first 100 million bytes of a Wikipedia XML dump. The train, validation and test sets consist of the first 90M, 5M, and 5M characters, respectively.

Rocking The Boat Of Buzzwords

Right from the beginning of the paper, it has been obvious that the author was not impressed with the way acronyms were being dished out across the machine learning community.

Explaining his idea behind naming it as a boom layer, Smerity reveals that It’s really not that hard to visualize – 

Use your hands if you need to whilst shouting “boooOOOOmmm”

Yes, it’s as straightforward as it sounds.

Language holds clues to the way the mind works and since machine learning attempts to crack that code, it is of utmost importance that language should be studied from all possible ways and not be satiated with few working models. 

The author, in his paper, complains that by putting more work in to inflexibly defining a machine learning model, we’ve accidentally handicapped it.

The author’s lone goal is to show that the entire field might have evolved in a different direction if we had instead been obsessed with a slightly different acronym and a slightly different result.

Key Takeaways From The Paper

  • The author shows machine learning (NLU) would have evolved in a different direction if we had instead been obsessed with a slightly different acronym.
  • The final results are achievable in plus or minus 24 hours on a single GPU.
  • The attention mechanism is also readily extended to large contexts with minimal computation.
  • It’s okay to make research more fun while being informative. This will scare away fewer people from the core research topics.

In all seriousness, the author voices the need for a Moore’s Law for machine learning that encourages a minicomputer future, not a mainframe one. He also assures that he plans on rebuilding the codebase from the ground up both as an educational tool for others and as a strong platform for future work in academia and industry.

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

More Stories


8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

Yugesh Verma
All you need to know about Graph Embeddings

Embeddings can be the subgroups of a group, similarly, in graph theory embedding of a graph can be considered as a representation of a graph on a surface, where points of that surface are made up of vertices and arcs are made up of edges

Yugesh Verma
A beginner’s guide to Spatio-Temporal graph neural networks

Spatio-temporal graphs are made of static structures and time-varying features, and such information in a graph requires a neural network that can deal with time-varying features of the graph. Neural networks which are developed to deal with time-varying features of the graph can be considered as Spatio-temporal graph neural networks. 

Vijaysinh Lendave
How to Evaluate Recommender Systems with RGRecSys?

A recommender system, sometimes known as a recommendation engine, is a type of information filtering system that attempts to forecast a user’s “rating” or “preference” for an item. In this post, we will look at RGRecSys, a library that performs constraint evaluation of recommender systems.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM