GPT-3 was a huge marker of a step forward in AGI research, and since the model’s launch, several large language models have been developed. So much so that large language models have easily become one of the key innovations of the decade. However, despite the several popular models that have been created, these language models are far from easy. In a recent NVIDIA talk during NVIDIA GTC Conference, Meriem Bendris and Miguel Martinez, Senior Deep Learning Data Scientists at NVIDIA, discussed the various aspects of building a large language model from scratch.
Transformer based models
While the past three years have seen a growth in transformer-based models, it has also witnessed a double increase in its complexity. For instance, GPT -3’s 175 billion parameters require almost three terabytes of memory footprint. Such cases lead to three major challenges:
- Efficiently collecting and storing the training data
- Creating algorithms or hardware adapted to limited training on several machines
- Efficiently deploying the machine in real applications
Miguel spoke about the scalable and predictable nature of deep learning, where, theoretically, we can achieve any arbitrary performance if exposed to enough data. But due to the increasing model size, it becomes harder to control model loss – that is usually very expensive. Here, NLP benefits from models with large parameters, data set and training time. Today, beyond the performance, language models also can solve NLP problems on zero-shot mode and query the model with a prompt.
Larger language models are more capable of taking advantage of zero-shot learning capabilities, explained Meriem. Additionally, prompt tuning can achieve stronger performance. These lead to a fundamental shift in developing language models for NLP.
The new NLP approach is to develop large scale language models trained using multiple text resources. Then, with a little effort, query the prompt to get the answers. The team demonstrated this model on an animated version of Jensen answering questions it was not trained to answer, but the training approach allowed him to do so.
A recurring problem is the inability to use NLP technologies in many local languages. Most large language models work perfectly only when the tasks are in English. Miguel discussed how businesses could leverage NLP for local language tasks. Businesses can benefit through tasks such as chatbots, machine translation, document classification and automatic transcription. In the talk, the researchers detailed creating two language models that work on French and Spanish and explained their processes.
Data preparation: Data preparation is a crucial step since the quality of the data largely contributes to the quality of the model. The team used a multilingual corpus for the French and Spanish datasets. In addition, there are similar documents on crawled data on the internet, making it important to filter them and ensure a fair evaluation. The four main steps are data de-duplication, language model filtering, general cleaning and blacklist modelling. “We started from 138 gigabytes for French and 151 gigabytes for Spanish. The entire data cleaning process took 12 hours, resulting in 170 gigabytes of cleaned data and 149 gigabytes of Spanish,” she explained. Later, the tokeniser needs to learn the language after clearing the data. The team used a BPE tokaniser. Finally, the library requires data to be converted into an MMAP format for efficient data entry.
Pre-training the model: It is important to efficiently distribute the data and the model. Two popular model distribution techniques are data and model distribution. The data is split across multiple machines and can speed up the process. Pipeline and Tensor model parallelism are also used. Hyperparameters need to be optimised for the target architectures, Miguel explained. Once the trained model perplexity is derived, the next step is to generate text to validate the quality of the model manually. The team used the REST server. The analysis showed the model could generate grammatically and semantically correct text but contextualise the generation with the input.
Downstream tasks: The model can be trained using labelled data for downstream tasks to apply language models for specific NLP tasks. Encoder based language models fine-tune the classifier layer on top of the language model while the decoder language model trains further for the next token generation on labelled examples. Large models that performed on zero-shot can be queried by formulating the NLP task description to generate the answer. NLP tasks can be described efficiently by manually defining robust templates for each NLP task. Prompt engineering and prompt tuning are used to optimise the best prompt template using the data. Prompt tuning uses additional features on the input text that will be fine-tuned for specific tasks. According to the NLP task, the language model remains the same while specific layers are activated differently.
Product deployment: The last step is efficiently deploying the model into production. There is a heavy memory footprint for models at times, making it difficult to deploy them. The size can be reduced in half while maintaining the same accuracy by pruning the model. To retain real-time response precision, a quantisation process is used.