MITB Banner

GPT-2 Vs Chinese Language Model: How Was The Latter Trained

Share

In a recent development, Chinese researchers have created a gigantic language model that can be compared to GPT-2 in terms of the number of parameters that it is trained on. The language model developed by the researchers from Tsinghua University and the Beijing Academy of Artificial Intelligence has trained on around 2.6 billion parameters with 100GB of Chinese data. 

To set the context, GPT-2 was trained on around 1.5 billion parameters. Chinese Pre-trained Language Model or CPM, as the language model is called, comes in different sizes, showcasing an increase in capabilities with an increase in the size of the model. Researchers claimed that it is the largest Chinese pre-trained language model, which can perform a wide range of NLP tasks. While 100 GB data is relatively low when compared to GPT -3’s 570GB training data, the results have been quite satisfactory, and researchers are aiming to achieve greater scalability going forward. 

The Tech Behind CPM & How It Differs From GPT Models

While GPT-3 was a massive achievement for researchers and has seen applicability in various fields, applying GPT-3 to address Chinese NLP tasks remained a big challenge. It is because the training corpus of GPT-3 is primarily English — almost 93% — and the parameters are not publicly available. 

To overcome this primary challenge, researchers started training around Chinese data to make it more relevant for them. With their continued efforts, researchers have achieved success in facilitating several Chinese NLP tasks — conversation, language understanding, essay generation, to name a few. It, in fact, showcases that CPM achieves strong performance in the setting of few-shots, including zero-shot learning. 

When compared with GPT-3, on tasks such as question answering, summarization, conversation, basic computing calculations, writing and more, CPM’s performance is quite remarkable. 

The Chinese researchers explored the previous work on Chinese pre-trained language models by expanding on the Chinese vocabulary and re-designing the training strategy. They built a new sub-word vocabulary and adjusted the training batch size to 3, 072 for more stable model training. 

Explaining their work on vocabulary construction, researchers said that the previous works on Chinese pre-trained models usually adopt the subword vocabulary of BERT-Chinese, which would split the input text to a character-level sequence. However, Chinese words usually contain several characters, and some important semantic meanings of words would be lost in the character-level sequence. “To solve this problem, we construct a new sub-word vocabulary, containing both words and characters,” noted researchers. 

In terms of training strategy, they adopted a large batch size to make the model training more stable. Compared to the batch size used in GPT-3 which was 1 million tokens, their batch size is two times larger with 3 million tokens. Further, the researchers noted that for the largest model, which cannot be stored in a single GPU during training, they partitioned the model across GPUs along the width dimension to make the large-scale training available and reduce data transfer among nodes. 

Researchers collected different kinds of texts and from various sources in pre-training such as encyclopedia, news, novels, and Q&A. 

Wrapping Up

While researchers achieved satisfactory results so far, they plan to further explore the power of large-scale pre-trained models by adding more training data to include diversity and increase the model size. They also plan to optimise the training framework, such as the data-transfer scheme between different nodes, to accelerate the training process further. “For text data, we will add a multilingual corpus to train a large-scale Chinese-centered multilingual language model,” they said.

While CPM is currently used only for technical and scientific purposes, there are opinions from experts that unlike GPT-3, CPM does not currently focus on the biases of this model. Having said that, the research has come out hardly six months after the paper on GPT-3 was published, which is quite remarkable, and aims to achieve greater efficiency in the coming future.

Read the complete paper here. 

Share
Picture of Srishti Deoras

Srishti Deoras

Srishti currently works as Associate Editor at Analytics India Magazine. When not covering the analytics news, editing and writing articles, she could be found reading or capturing thoughts into pictures.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.