Last updated February 22, 2024
In AI News & Update

Neural Magic open sources a pruned version of BERT language model

US-based Neural Magic, in collaboration with Intel Corporation, have come up with their own 'pruned' version of BERT-Large that is eight times faster and 12 times smaller in size and storage space.

Share

Published on July 21, 2022

by Kartik Wali

Listen to this story

In a reverse trend of sorts, researchers are now looking for ways to reduce the huge computational cost and size of language models without hampering their accuracy.

Source: neuralmagic.com

In this endeavour, US-based Neural Magic, in collaboration with Intel Corporation, has developed their own ‘pruned’ version of BERT-Large that is eight times faster and 12 times smaller in size and storage space. To achieve this, The researchers combined pruning and sparcing processes in the pre-training stage to create general, sparse architectures finetuned and quantised onto datasets for standard tasks like SQuAD for question answering. This method resulted in highly compressed networks without considerable deviation in terms of accuracy with regard to the unoptimised models. As part of their research, Intel has released the Prune OFA models on Hugging Face.

Deployment with DeepSparce

The DeepSparse Engine is specifically engineered to accelerate sparse and sparse-quantized networks. This approach leverages sparsity to reduce the overall compute and take advantage of the CPU’s large caches to access memory at a faster pace. With this method, a GPU-class performance can be achieved on commodity CPUs. Combining DeepSparse with the Prune Once for All sparse-quantized models yields 11x better performance in throughput and 8x better performance for latency-based applications, beating BERT-base and achieving DistilBERT level performance without sacrificing accuracy.

Source: neuralmagic.com

The graph above highlights the relationship between networks for scaling their structured size vs sparsifying them to remove redundancies. The performant DistilBERT model has the least number of layers and channels and the lowest accuracy. With more layers and channels added, BERT-base is less performant and more accurate. Finally, BERT-Large is the most accurate with the largest size but the slowest inference. Despite the reduced number of parameters, the sparse-quantized BERT-Large is close in accuracy to the dense version and inferences 8x faster. So, while the larger optimisation space helped when training, not all of these pathways were necessary to maintain accuracy. The redundancies in these larger networks surface even more when comparing the file sizes necessary to store these models, as shown in the graph below.

Source: neuralmagic.com

For more information, click here

Access all our open Survey & Awards Nomination forms in one place

Kartik Wali

A writer by passion, Kartik strives to get a deep understanding of AI, Data analytics and its implementation on all walks of life. As a Senior Technology Journalist, Kartik looks forward to writing about the latest technological trends that transform the way of life!