Listen to this story
Recently, Russian company Yandex open sourced YaLM 100B, a bilingual neural network for generating and processing text.
“By making YaLM 100B publicly available, we hope to give impetus to further developing generative neural networks,” said Petr Popov, CEO of Yandex Technologies.
The development comes at a time when several big companies like Meta, Google, and OpenAI have open-sourced some of their large transformer-based models. In early 2021, researchers at Google Brain open-sourced the Switch Transformer, natural-language processing (NLP) AI model. EleutherAI open-sourced its large language model (LLM) GPT-NeoX-20B in April 2022, followed by Meta AI open-sourcing the first version of OPT-175B.
Sign up for your weekly dose of what's up in emerging technology.
What is YaLM 100B?
YaLM 100B is a GPT-like neural network for generating and processing text. It is the largest language model from the YaLM family. YaLM language models help determine the principles of constructing texts and generate new ones based on the rules of linguistics and their knowledge of the world. YaLM can not only create texts but also classify them according to the styles of speech.
Yandex has been using YaLM neural networks in its voice assistant, Alice and its search engine Yandex Search.
YaLM 100B has been released under the Apache 2.0 license, which permits research and commercial use.
Training the model
Training large-scale language models is resource-intensive. “Training generative neural networks requires substantial resources, experienced professionals and years of work. And it is important for us that not only the largest IT companies have access to modern technologies, but the entire community of researchers and developers,” said Popov.
Developers at Yandex trained YaLM 100B on a cluster of 800 A100 graphics cards for 65 days. During the training, the neural network consumed 300B tokens and processed 1.7TB of texts in English and Russian. The datasets used for training YaLM 100B roughly include 25% of text from the Pile dataset (open English dataset by EleutherAI team) and 75% of text in the Russian language from various sources like Wikipedia, preprocessed dialogues from social media, Taiga Dataset, Russian Distributional Thesaurus dataset and Yandex Search index.
Developers used DeepSpeed, a deep learning optimization library, to train the model. DeepSpeed makes distributed training and inference easy, efficient, and effective.
The researchers explained how they trained the model and suggested ways to accelerate model training. According to them, a 10% increase in training speed can reduce runtime on a high-value cluster by a week.
Training iterations usually include the following steps:
- Preparing the batch
- Calculating the activation and loss functions by running forward propagation
- Calculating gradients by running backward propagation
- Running the step stage to modify the model’s weights
Accelerating model training
To accelerate model training, developers suggest the following :
- Looking for bottlenecks: The team recommends using a profiler to identify performance bottlenecks in the models. Using a profiler helps you understand how the training time is spent. For example, researchers could analyze why one operation took almost 50% of the entire training time. Thus, they could reduce the token embedding size to avoid excessive matrix multiplication at the end of the network. This helped in speeding up the training process.
- Using fast data types: The types of data used to store the model and perform necessary calculations determine the speed of training and inference. Therefore, developers recommend using fast data types. For example, on A100 and newer graphics cards, 16-bit data types like fp16 and bfloat16 are five times faster than fp32(Single-precision format) and 2.5 times faster than 19-bit data type tf32(TensorFloat format). However, older graphics cards do not support bf16 and tf32 data types, and fp16 is only two times faster than fp32.
- Accelerating GPU operations: You can fully utilize GPUs by increasing the batch size. Increasing the batch size helps in accelerating the training speed. To minimize memory interaction, developers suggest fusing the kernels using torch.jit.script, writing your own CUDA kernels, or using ready-made CUDA kernels available in Megatron-LM and DeepSpeed libraries. For example, using the torch.jit.script developers fused three operations- tensor add, dropout and another tensor add that helped them increase the learning rate by 5%. For accelerated training of YaLM, developers used different kinds of fused kernels that sped up training by almost 1.5 times. If you have a lot of data and no retraining at dropout == 0, disable dropouts! This increased their computing speed by 15%.
NVIDIA NCCL library helped ensure maximum communication speed by allowing GPUs to effectively communicate over the network without any CPU intermediaries. Further, using Zero Redundancy Optimizer (ZeRO) accelerated communication even more.
Though ZeRO helped save huge amounts of memory, it brought in complexity by adding new heavy operations. To overcome this, developers gathered the different layers asynchronously one after the other. This technique helped developers gain 80% speed in training their models.
Divergence and stabilization strategies
The model was prone to divergence. When divergence occurs, a machine learning model gradually forgets what it has learnt. To deal with this, developers deployed the following stabilization strategies.
- Adopted Bf16 as the main type for weights.
- Ran precision-critical computations in tf32
- Introduced Pre-LayerNorm, and after embeddings, they added LayerNorm.
- Used Curriculum Learning, a training strategy that trains a machine learning model from easier data to harder data. It helps in improving the generalization capacity and convergence rate of various models.