Recently, the researchers at Google AI unveiled an extension of the projection attention neural network PRADO, known as pQRNN. According to the researchers, this new extension advances state of the art for NLP performance with minimal model size.
Long text classification is one of the fundamental tasks in Natural Language Processing (NLP). The pQRNN model is able to achieve BERT-level performance on a text classification task with orders of magnitude using a fewer number of parameters.
Prabhu Kaliamoorthi, Software Engineer at Google Research, stated in a blog post that over the last decade, techniques like natural language processing (NLP) and other speech applications had been significantly transformed by using deep learning methods. He added, “However, issues such as preserving user privacy, eliminating network latency, enabling offline functionality, and reducing operation costs have rapidly spurred the development of the NLP models that can be run on-device rather than in data centres.”
Also, devices like smartphones and smartwatches include limited memory and low computational capacity, which will require ML models that will be running on them to be small as well as efficient, without compromising the quality of the model.
Developed last year, the projection attention neural network PRADO is a combination of trainable projections with attention and convolutions. Using this model, the researchers trained tiny neural networks just 200 Kilobytes in size that improve over prior CNN and LSTM
models, and achieved near state-of-the-art performance on multiple long document classification tasks.
According to the researchers, while most models use a fixed number of parameters per token, the PRADO model used a network structure that required extremely few parameters to learn the most relevant or useful tokens for the task.
Behind The pQRNN PRADO Extension
The PRADO model was designed to learn clusters of text segments from words rather than word pieces or characters, which enabled the NLP model to achieve a significant performance on low-complexity NLP tasks. “Building on the success of PRADO, we developed an improved NLP model, called pQRNN,” said Kaliamoorthi.
The pQRNN PRADO extension model is mainly composed of three building blocks:
- A projection operator
- A dense bottleneck layer, and
- A stack of QRNN encoders
Projection Operator: The projection converts tokens in the text to a sequence of ternary vectors. According to researchers, the implementation of the projection layer in pQRNN is identical to that used in PRADO. It helps the model learn the most relevant tokens without a fixed set of parameters to define them.
Dense Bottleneck Layer: The dense bottleneck layer allows the network to learn a per word representation that is relevant for the task at hand.
Stack of QRNN Encoders: According to the researchers, the representation resulting from the bottleneck layer is not capable of taking the context of the word into account. This is where QRNN encoders come into play. The researchers learned the contextual representation by using a stack of bidirectional QRNN encoders.
pQRNN vs BERT
Kaliamoorthi said that the combination of the three building blocks resulted in a network that is capable of learning a contextual representation from just text input without employing any preprocessing.
The researchers evaluated pQRNN on the civil_comments dataset and compared it with the BERT model on the same task. According to them, the public pre-trained version of BERT performed poorly on the task hence the comparison is made to a BERT version that is pre-trained on several different relevant multilingual data sources to achieve the best possible performance. Although pQRNN is much smaller in size than BERT, yet it achieved BERT-level performance.
Wrapping Up
The main idea behind developing all these efficient models is to create embedding-free models that minimise the model size without affecting the computational power and the quality of the model. The researchers have open-sourced the PRADO model to stimulate further research in this area and encourage the community to use it as a jumping-off point for new model architectures.