MITB Banner

Google’s New NLP Model Achieves BERT-Level Performance Using Few Parameters

Share

Recently, the researchers at Google AI unveiled an extension of the projection attention neural network PRADO, known as pQRNN. According to the researchers, this new extension advances state of the art for NLP performance with minimal model size. 

Long text classification is one of the fundamental tasks in Natural Language Processing (NLP). The pQRNN model is able to achieve BERT-level performance on a text classification task with orders of magnitude using a fewer number of parameters. 

Prabhu Kaliamoorthi, Software Engineer at Google Research, stated in a blog post that over the last decade, techniques like natural language processing (NLP) and other speech applications had been significantly transformed by using deep learning methods. He added, “However, issues such as preserving user privacy, eliminating network latency, enabling offline functionality, and reducing operation costs have rapidly spurred the development of the NLP models that can be run on-device rather than in data centres.”

Also, devices like smartphones and smartwatches include limited memory and low computational capacity, which will require ML models that will be running on them to be small as well as efficient, without compromising the quality of the model.

Developed last year, the projection attention neural network PRADO is a combination of trainable projections with attention and convolutions. Using this model, the researchers trained tiny neural networks just 200 Kilobytes in size that improve over prior CNN and LSTM

models, and achieved near state-of-the-art performance on multiple long document classification tasks.

According to the researchers, while most models use a fixed number of parameters per token, the PRADO model used a network structure that required extremely few parameters to learn the most relevant or useful tokens for the task.

Behind The pQRNN PRADO Extension 

The PRADO model was designed to learn clusters of text segments from words rather than word pieces or characters, which enabled the NLP model to achieve a significant performance on low-complexity NLP tasks. “Building on the success of PRADO, we developed an improved NLP model, called pQRNN,” said Kaliamoorthi.

The pQRNN PRADO extension model is mainly composed of three building blocks:

  1. A projection operator
  2. A dense bottleneck layer, and
  3. A stack of QRNN encoders

Projection Operator: The projection converts tokens in the text to a sequence of ternary vectors. According to researchers, the implementation of the projection layer in pQRNN is identical to that used in PRADO. It helps the model learn the most relevant tokens without a fixed set of parameters to define them.  

Dense Bottleneck Layer: The dense bottleneck layer allows the network to learn a per word representation that is relevant for the task at hand. 

Stack of QRNN Encoders: According to the researchers, the representation resulting from the bottleneck layer is not capable of taking the context of the word into account. This is where QRNN encoders come into play. The researchers learned the contextual representation by using a stack of bidirectional QRNN encoders.

pQRNN vs BERT

Kaliamoorthi said that the combination of the three building blocks resulted in a network that is capable of learning a contextual representation from just text input without employing any preprocessing.

The researchers evaluated pQRNN on the civil_comments dataset and compared it with the BERT model on the same task. According to them, the public pre-trained version of BERT performed poorly on the task hence the comparison is made to a BERT version that is pre-trained on several different relevant multilingual data sources to achieve the best possible performance. Although pQRNN is much smaller in size than BERT, yet it achieved BERT-level performance.

Wrapping Up

The main idea behind developing all these efficient models is to create embedding-free models that minimise the model size without affecting the computational power and the quality of the model. The researchers have open-sourced the PRADO model to stimulate further research in this area and encourage the community to use it as a jumping-off point for new model architectures.

PS: The story was written using a keyboard.
Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India