GPT-2 Vs Chinese Language Model: How Was The Latter Trained

Share

Published on December 12, 2020

by Srishti Deoras

In a recent development, Chinese researchers have created a gigantic language model that can be compared to GPT-2 in terms of the number of parameters that it is trained on. The language model developed by the researchers from Tsinghua University and the Beijing Academy of Artificial Intelligence has trained on around 2.6 billion parameters with 100GB of Chinese data.

To set the context, GPT-2 was trained on around 1.5 billion parameters. Chinese Pre-trained Language Model or CPM, as the language model is called, comes in different sizes, showcasing an increase in capabilities with an increase in the size of the model. Researchers claimed that it is the largest Chinese pre-trained language model, which can perform a wide range of NLP tasks. While 100 GB data is relatively low when compared to GPT -3’s 570GB training data, the results have been quite satisfactory, and researchers are aiming to achieve greater scalability going forward.

The Tech Behind CPM & How It Differs From GPT Models

While GPT-3 was a massive achievement for researchers and has seen applicability in various fields, applying GPT-3 to address Chinese NLP tasks remained a big challenge. It is because the training corpus of GPT-3 is primarily English — almost 93% — and the parameters are not publicly available.

To overcome this primary challenge, researchers started training around Chinese data to make it more relevant for them. With their continued efforts, researchers have achieved success in facilitating several Chinese NLP tasks — conversation, language understanding, essay generation, to name a few. It, in fact, showcases that CPM achieves strong performance in the setting of few-shots, including zero-shot learning.

When compared with GPT-3, on tasks such as question answering, summarization, conversation, basic computing calculations, writing and more, CPM’s performance is quite remarkable.

The Chinese researchers explored the previous work on Chinese pre-trained language models by expanding on the Chinese vocabulary and re-designing the training strategy. They built a new sub-word vocabulary and adjusted the training batch size to 3, 072 for more stable model training.

Explaining their work on vocabulary construction, researchers said that the previous works on Chinese pre-trained models usually adopt the subword vocabulary of BERT-Chinese, which would split the input text to a character-level sequence. However, Chinese words usually contain several characters, and some important semantic meanings of words would be lost in the character-level sequence. “To solve this problem, we construct a new sub-word vocabulary, containing both words and characters,” noted researchers.

In terms of training strategy, they adopted a large batch size to make the model training more stable. Compared to the batch size used in GPT-3 which was 1 million tokens, their batch size is two times larger with 3 million tokens. Further, the researchers noted that for the largest model, which cannot be stored in a single GPU during training, they partitioned the model across GPUs along the width dimension to make the large-scale training available and reduce data transfer among nodes.

Researchers collected different kinds of texts and from various sources in pre-training such as encyclopedia, news, novels, and Q&A.

Wrapping Up

While researchers achieved satisfactory results so far, they plan to further explore the power of large-scale pre-trained models by adding more training data to include diversity and increase the model size. They also plan to optimise the training framework, such as the data-transfer scheme between different nodes, to accelerate the training process further. “For text data, we will add a multilingual corpus to train a large-scale Chinese-centered multilingual language model,” they said.

While CPM is currently used only for technical and scientific purposes, there are opinions from experts that unlike GPT-3, CPM does not currently focus on the biases of this model. Having said that, the research has come out hardly six months after the paper on GPT-3 was published, which is quite remarkable, and aims to achieve greater efficiency in the coming future.

Read the complete paper here.

Access all our open Survey & Awards Nomination forms in one place