GPT-3, the largest neural network with 175 billion parameters, became last year’s breakthrough innovation. Released in June 2020, it can write codes like humans, author blogs, stories, websites, and create apps. Its predecessor GPT-2 had just 1.5 billion parameters.
Large-scale pretrained language models (PLMs) such as GPT-3 have demonstrated good performance on natural language generation with few-shot in-context learning. However, among other disadvantages, most language models are only available in English. Large models like GPT-3 have been trained on a 45-terabyte dataset which has been drawn exclusively from English sources.
Sign up for your weekly dose of what's up in emerging technology.
A possible solution to this problem may be in sight as Chinese company Huawei has developed PanGu Alpha, a 750-gigabyte model that contains up to 200 billion parameters. It is being touted as the Chinese equivalent of GPT-3 and is trained on 1.1 terabytes of Chinese language ebooks, encyclopedias, news, social media posts, and websites.
Training PanGu Alpha vs GPT-3
There are three main challenges in training large models with more than 10 billion parameters:
Model Design: Not all PLMs can be smoothly scaled to hundreds of billions of parameters. They may face challenges such as slow convergence or divergence during the training as the model size increases. To overcome this, for PanGu Alpha, researchers chose the Transformer-based autoregressive language model as the base architecture. An additional query layer was also added above the Transformer layer, which helped the model scale up to 200 billion parameters.
Training data: While on the one hand, the amount of data should be enough to feed a large PLM, and on the other, this data should be of high quality and diversity to ensure the model’s generality. The Huawei team collected data from sources such as ebooks, Common Crawl, news, and encyclopedias. The data was then processed, filtered, and cleaned to ensure quality and reliability. They eliminated documents containing less than 60% Chinese characters, less than 150 characters, only titles, navigation bars, or advertisements. The text was then converted to simplified Chinese, and over 700 offensive words, spams and low-quality samples were filtered out.
Distributed training: The memory requirement of PanGu Alpha with 200 billion parameters is beyond the scope of modern AI processors. The problem becomes more challenging when the topology of hardware is considered. For PanGu Alpha, researchers combined five-dimensional parallel functionalities and applied them to the model, trained on a cluster of 2048 Ascend AI processors powered by CANN4.
For this study, the authors trained three models on a high-quality Chinese text corpus with increasing magnitude of parameter sizes –2.6 billion, 13 billion, and 200 billion. The models were first evaluated on language modelling tasks, and it was noted that the perplexity decreases with increasing model capacity and amount of data and computation. Further, the team also investigated the text generation ability of the model in different scenarios such as dialogue generation, summarization, and question answering. The experimental results showed that, in general, the performance of the model improves with growing model capacity.
Experts believe that the most important feature of PanGu Alpha is its availability in the Chinese language. However, it seems like that in terms of model architecture; this project doesn’t offer anything new. Further, the team itself has made no claims of overcoming major blockers, such as answering math problems correctly or responding to questions without paraphrasing training data. It does not even offer a solution to the shortcomings of the GPT-3 model (such as bias) on which it is modelled. In addition, it carries over the same challenge as that of other large models — the carbon footprint.
The team is currently working on releasing the code via APIs for the larger benefit of both non-profit research institutes and commercial companies. Further, the team has open-sourced the parallel computing functionalities in the Auto-parallel module of MindSpore5. MindSpore5 is a deep learning training and inference framework that can be used for mobile, edge, and cloud platforms.
Read the full paper here.