GPT-2 Vs Chinese Language Model: How Was The Latter Trained

In a recent development, Chinese researchers have created a gigantic language model that can be compared to GPT-2 in terms of the number of parameters that it is trained on. The language model developed by the researchers from Tsinghua University and the Beijing Academy of Artificial Intelligence has trained on around 2.6 billion parameters with 100GB of Chinese data. 

To set the context, GPT-2 was trained on around 1.5 billion parameters. Chinese Pre-trained Language Model or CPM, as the language model is called, comes in different sizes, showcasing an increase in capabilities with an increase in the size of the model. Researchers claimed that it is the largest Chinese pre-trained language model, which can perform a wide range of NLP tasks. While 100 GB data is relatively low when compared to GPT -3’s 570GB training data, the results have been quite satisfactory, and researchers are aiming to achieve greater scalability going forward. 

The Tech Behind CPM & How It Differs From GPT Models

While GPT-3 was a massive achievement for researchers and has seen applicability in various fields, applying GPT-3 to address Chinese NLP tasks remained a big challenge. It is because the training corpus of GPT-3 is primarily English — almost 93% — and the parameters are not publicly available. 


Sign up for your weekly dose of what's up in emerging technology.

To overcome this primary challenge, researchers started training around Chinese data to make it more relevant for them. With their continued efforts, researchers have achieved success in facilitating several Chinese NLP tasks — conversation, language understanding, essay generation, to name a few. It, in fact, showcases that CPM achieves strong performance in the setting of few-shots, including zero-shot learning. 

When compared with GPT-3, on tasks such as question answering, summarization, conversation, basic computing calculations, writing and more, CPM’s performance is quite remarkable. 

The Chinese researchers explored the previous work on Chinese pre-trained language models by expanding on the Chinese vocabulary and re-designing the training strategy. They built a new sub-word vocabulary and adjusted the training batch size to 3, 072 for more stable model training. 

Explaining their work on vocabulary construction, researchers said that the previous works on Chinese pre-trained models usually adopt the subword vocabulary of BERT-Chinese, which would split the input text to a character-level sequence. However, Chinese words usually contain several characters, and some important semantic meanings of words would be lost in the character-level sequence. “To solve this problem, we construct a new sub-word vocabulary, containing both words and characters,” noted researchers. 

In terms of training strategy, they adopted a large batch size to make the model training more stable. Compared to the batch size used in GPT-3 which was 1 million tokens, their batch size is two times larger with 3 million tokens. Further, the researchers noted that for the largest model, which cannot be stored in a single GPU during training, they partitioned the model across GPUs along the width dimension to make the large-scale training available and reduce data transfer among nodes. 

Researchers collected different kinds of texts and from various sources in pre-training such as encyclopedia, news, novels, and Q&A. 

Wrapping Up

While researchers achieved satisfactory results so far, they plan to further explore the power of large-scale pre-trained models by adding more training data to include diversity and increase the model size. They also plan to optimise the training framework, such as the data-transfer scheme between different nodes, to accelerate the training process further. “For text data, we will add a multilingual corpus to train a large-scale Chinese-centered multilingual language model,” they said.

While CPM is currently used only for technical and scientific purposes, there are opinions from experts that unlike GPT-3, CPM does not currently focus on the biases of this model. Having said that, the research has come out hardly six months after the paper on GPT-3 was published, which is quite remarkable, and aims to achieve greater efficiency in the coming future.

Read the complete paper here. 

More Great AIM Stories

Srishti Deoras
Srishti currently works as Associate Editor at Analytics India Magazine. When not covering the analytics news, editing and writing articles, she could be found reading or capturing thoughts into pictures.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM