Recently, the researchers at Amazon introduced an optimal subset of the popular BERT architecture for neural architecture search. This smaller version of BERT is known as BORT and is able to be pre-trained in 288 GPU hours, which is 1.2% of the time required to pre-train the highest-performing BERT parametric architectural variant, RoBERTa-large.
Since its inception, BERT has achieved several groundbreaking tasks in the field of natural language processing (NLP) and natural language understanding (NLU). It has made a resounding impact in the area of language modelling, as well.
However, several times, the usability of BERT has been considered an issue for various serious concerns, such as its larger size, slow inference time, complex pre-training process, among others.
This is why finding a high-performing compressed BERT architecture has been an active area of research since the original article was released. Researchers have been trying to extract a simpler sub-architecture of this language model that maintains a similar performance of its predecessor while simplifying the pre-training process as well as the inference time–to varying degrees of success. The most prominent studies related to it include TinyBERT, DistilBERT, BERT-of-Theseus, among others.
But many a time, such an attempt fails because the performance of such sub-architecture is still being overshadowed by the original implementation in terms of accuracy. Alongside, the choice of the set of architectural parameters in these works often appears to be arbitrary.
BORT is an optimal sub-architecture from a high-performing BERT variant and is 16% the size of BERT-large. It also performs inference eight times faster on a CPU. In order to extract the subset of BERT, the researchers used an approximation algorithm, known as a fully polynomial-time approximation scheme, or FPTAS. According to the researchers, under certain conditions, this algorithm is able to extract such a set with optimal guarantees efficiently.
The researchers considered the problem of extracting the set of architectural parameters for BERT that is optimal over three metrics, which are inference latency, parameter size and error rate.
BORT is closely related to the specific variation of the RoBERTa architecture as the maximum point and vocabulary are based on RoBERTa. According to the researchers stated that with regards to characterisation of its architectural parameter set, the Bort model is fairly similar to other compressed variants of the BERT model architecture. With this, the most intriguing fact would be the depth of the network is D = 4 for all but one of the models–which provides a good empirical check with respect to our experimental setup.”
For training the model, the researchers combined corpora obtained from Wikipedia, Wiktionary, OpenWebText (Gokaslan and Cohen, 2019), UrbanDictionary, Onel Billion Words (Chelba et al., 2014), the news subset of Common Crawl (Nagel, 2016)10, and Bookcorpus. This was due to the requirement of having a sufficiently diverse dataset to pre-train Bort.
The researchers further evaluated popular public NLU benchmarks such as GLUE, SuperGLUE and Reading Comprehension from Examinations (RACE). BORT obtained significant improvements in all of them with respect to BERT-large
The contributions made in this research are mentioned below-
- The researchers considered the problem of extracting the set of architectural parameters for BERT that is optimal over three metrics, which are inference latency, parameter size, and error rate.
- They extracted an optimal sub-architecture from a high-performing BERT variant, known as BORT.
- BORT is 16 per cent the size of BERT-large and performs inference eight times faster on a CPU.
- According to the researchers, while pre-training the BORT, it has been found that the time required to pre-train the model is remarkably improved with respect to its original counterpart.
- BORT is also being evaluated on popular benchmarks like GLUE, SuperGLUE, etc. and has achieved significant improvements in all of the benchmarks with respect to the BERT-large model.
Compared to the training time for BERT-large, which is 1,153 GPU hours for the world record on the same hardware but with a dataset ten times smaller, and RoBERTa-large, which is 25,764 hours with a slightly larger dataset, BoORT remains more efficient that these two popular models. Also, according to the researchers, this comparison is inaccurate, as the frameworks of deep learning technology used for the training of these models are changed, although the same GPU model was used across the board.
BORT is smaller, faster and more efficient to pre-train, and able to outperform nearly every other member of the family across a wide variety of NLU tasks. To conclude, the researchers stated that the success of Bort in terms of faster pre-training and efficient fine-tuning would not have been possible without the existence of a highly optimised BERT–the RoBERTa architecture,” the researchers concluded.
The code of the BORT has been open-sourced and can be found on GitHub.
Read the paper here.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.