MITB Banner

Neural Magic open sources a pruned version of BERT language model

US-based Neural Magic, in collaboration with Intel Corporation, have come up with their own 'pruned' version of BERT-Large that is eight times faster and 12 times smaller in size and storage space.

Share

Listen to this story

In a reverse trend of sorts, researchers are now looking for ways to reduce the huge computational cost and size of language models without hampering their accuracy. 

Source: neuralmagic.com

In this endeavour, US-based Neural Magic, in collaboration with Intel Corporation, has developed their own ‘pruned’ version of BERT-Large that is eight times faster and 12 times smaller in size and storage space. To achieve this, The researchers combined pruning and sparcing processes in the pre-training stage to create general, sparse architectures finetuned and quantised onto datasets for standard tasks like SQuAD for question answering. This method resulted in highly compressed networks without considerable deviation in terms of accuracy with regard to the unoptimised models. As part of their research, Intel has released the Prune OFA models on Hugging Face.

Deployment with DeepSparce 

The DeepSparse Engine is specifically engineered to accelerate sparse and sparse-quantized networks. This approach leverages sparsity to reduce the overall compute and take advantage of the CPU’s large caches to access memory at a faster pace. With this method, a GPU-class performance can be achieved on commodity CPUs. Combining DeepSparse with the Prune Once for All sparse-quantized models yields 11x better performance in throughput and 8x better performance for latency-based applications, beating BERT-base and achieving DistilBERT level performance without sacrificing accuracy.

Source: neuralmagic.com

The graph above highlights the relationship between networks for scaling their structured size vs sparsifying them to remove redundancies. The performant DistilBERT model has the least number of layers and channels and the lowest accuracy. With more layers and channels added, BERT-base is less performant and more accurate. Finally, BERT-Large is the most accurate with the largest size but the slowest inference. Despite the reduced number of parameters, the sparse-quantized BERT-Large is close in accuracy to the dense version and inferences 8x faster. So, while the larger optimisation space helped when training, not all of these pathways were necessary to maintain accuracy. The redundancies in these larger networks surface even more when comparing the file sizes necessary to store these models, as shown in the graph below.

Source: neuralmagic.com

For more information, click here

Share
Picture of Kartik Wali

Kartik Wali

A writer by passion, Kartik strives to get a deep understanding of AI, Data analytics and its implementation on all walks of life. As a Senior Technology Journalist, Kartik looks forward to writing about the latest technological trends that transform the way of life!
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India