How to Train Your LLM (Efficiently)

Companies, which released LLMs with hundreds and billions of parameters, are now backtracking from the overarching ‘Bigger is Better’ philosophy. Help is being offered to redress the unreasonable costs that come with the tall promises of LLMs.
Listen to this story

Last week open-source AI platform Hugging Face published a blog about the practical difficulties related to training large language models. The bigger the model, the larger the need for GPUs. 

Hugging Face’s BLOOM-176B, released in July this year, would need 80GB A100 GPUs multiplied by 8 times just to make the inference. The cost for each of which would be USD 15K. The post mentioned that to fine-tune the same model, 72 GPUs would be needed bringing the cumulative cost to train the model once to an astronomical figure. 


Sign up for your weekly dose of what's up in emerging technology.

Right after training BLOOM-176B, Hugging Face started looking for ways to run the model using lesser GPUs with the same level of performance. The company worked with open-source research community BigScience to come up with an approach that integrated Int8 inference into the training. 

Model Quantization by Hugging Face 

The blog was released along with a research paper titled, ‘LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale’ which discussed a new method for Int8 matrix multiplication to train feed-forward and attention projection layers in transformers. It was found upon testing that the procedure reduced the memory needed for inference by half while maintaining the predictive performance of the model. First, the properties of highly systematic emergent features in LLMs that are responsible for the attention and transformer predictive performance had to be understood and worked around. Then, the 175B parameters were loaded with a 16/32-bit checkpoint that could be converted to Int8 and used immediately. The team called the new two-part quantization procedure, LLM.int8(). 

The primary goal behind the research Hugging Face said was to make LLMs that in the past did not fit into GPU memory by making them more accessible. This opens up research and democratises LLMs to a great extent, which wasn’t possible due to limited GPU memory especially in the case of researchers and firms with scant resources. The paper also admitted that it was ironically widening the disparity between smaller and needy organisations and a Google, due to its findings. Now organisations that were flush with money could train more models using the same number of GPUs as well (which is already more than smaller organisations).

The release of the paper drew praise from many experts, including the former director of AI and autopilot vision at Tesla, Andrei Karpathy

The concept of model quantization has become a recent fad as computing costs grow exponentially. Using it, model data is converted from a floating-point representation to a lower-precision representation usually using 8-bit integers. Last year, NVIDIA GPUs employed the faster and cheaper 8-bit Tensor Cores to compute convolution and matrix-multiplication. This yielded more compute throughput, which especially helped with compute-limited layers. 

Models with an increasing number of parameters, Source: Hugging Face

Is open and transparent equal democratic?

In May, Meta released Open Pretrained Transformer (OPT-175B), touting it as the first large language model that was open source. The blog published by Meta titled, ‘Democratizing access to large-scale language models’, was a proof of how the tech giant was taking steps in the right direction to make LLMs accessible to all. The research paper published along with OPT-175B included both pretrained model and the code needed to train the LLM. However, it is worth remembering that this openness and transparency, while commendable, is not the same as democratic access. 

The race to build Large Language Models has a definite democratisation issue. Even with the noble intentions of companies like Hugging Face, there is a gaping hole that only becomes harder to fill. In fact, the cost and time to scale deep learning models and train neural networks with more layers and parameters was burning a hole noticeable enough even in richer organisations’ pockets. According to OpenAI, “since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time”. By 2019, this metric has bloated by a factor of 300,000. 

Meta claimed that OPT-175B’s carbon footprint was one-seventh of GPT-3. While this is a drastic decrease, it must be considered that several experts have estimated GPT-3’s training costs to be up to $27.6 million. This meant that it will still cost a few million dollars to train OPT-175B. Luckily, the model is pretrained, and Meta stated that it would provide the codebase to train and deploy the model “using only 16 NVIDIA V100 GPUs”. This would still cost an approximately USD 400,000, a significant sum of money for an independent researcher or firm. Meta used the 992 80GB A100 GPUs to train its own model, which is considerably faster than the V100. 

A couple of weeks ago, Meta AI released another paper titled, ‘Beyond neural scaling laws: beating power law scaling via data pruning’. This time around, the paper recognised the compute and energy draining process of scaling neural networks and offered a new approach of data pruning that ranked the order in which training examples should be discarded to achieve any pruned dataset size. 

The biggest large language models over the years, Source: Hugging Face

Prominent tech organisations have clearly taken a slight detour from the race to build the biggest LLMs. Companies, which released LLMs with hundreds and billions of parameters, are now backtracking from the overarching ‘Bigger is Better’ philosophy. Help is being offered to redress the unreasonable costs that come with the tall promises of LLMs. This may also come from the realisation that an oligopoly formed with Google, Meta and Microsoft hinders the overall quality of AI research. 

More Great AIM Stories

Poulomi Chatterjee
Poulomi is a Technology Journalist with Analytics India Magazine. Her fascination with tech and eagerness to dive into new areas led her to the dynamic world of AI and data analytics.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM