The Chinese govt-backed Beijing Academy of Artificial Intelligence’s (BAAI) has introduced Wu Dao 2.0, the largest language model till date, with 1.75 trillion parameters. It has surpassed OpenAI’s GPT-3 and Google’s Switch Transformer in size. HuggingFace DistilBERT and Google GShard are other popular language models. Wu Dao means ‘enlightenment’ in English.
“Wu Dao 2.0 aims to enable ‘machines’ to think like ‘humans’ and achieve cognitive abilities beyond the Turing test,” said Tang Jie, the lead researcher behind Wu Dao 2.0. The Turing test is a method to check whether or not a computer can think like humans.
Smartphone maker Xiaomi, short-video giant Kuaishou, on-demand service provider Meituan, 100 plus scientists and multiple organisations have collaborated with BAAI on this project.
Wu Dao 2.0
The Wu Dao 2.0 is a pre-trained AI model that uses 1.75 trillion parameters to simulate conversational speech, writes poems, understand pictures and even generate recipes. The next generation Wu Dao model can also predict the 3D structures of proteins, similar to DeepMind’s AlphaFold and power virtual idols. Recently, China’s first virtual student, Hua Zhibing, was built on Wu Dao 2.0.
The language model Wu Dao 2.0 was trained with FastMoE, a Fast Mixture-of-Expert (MoE) training system similar to Google’s Mixture of Experts. Unlike Google’s MoE, FastMoE is an open source system based on Pytorch (Facebook’s open-source framework) with common accelerators. It provides a hierarchical interface for flexible model design and easy adaption to various applications like Transformer-XL and Megatron-LM. The source code of FastMoE is available here.
“[FastMoE] is simple to use, high-performance, flexible, and supports large-scale parallel training,” wrote BAAI in its official WeChat blog.
Result-wise, Wu Dao 2.0 has surpassed SOTA levels on nine benchmark tasks, including:
- ImageNet (zero-shot) SOTA, exceeds OpenAI CLIP.
- LAMA knowledge detection, more than AutoPrompt
- LAMBADA Cloze (ability-wise), surpasses Microsoft Turing NLG
- SuperGLUE (few-short), surpasses OpenAI GPT-3
- UC Merced Land-Use (zero-shot) SOTA, exceeds OpenAI CLIP
- MS COCO (text generation diagram), surpasses OpenAI DALL-E
- MS COCO (English graphic retrieval), more than Google ALIGN and OpenAI CLIP
- MS COCO (multilingual graphic retrieval), surpasses (the current best multilingual and multimodal model) UC2, M3P
- Multi 30K (multilingual graphic retrieval), surpasses UC2, M3P
Showcasing benchmark tasks where Wu Dao 2.0 surpasses other SOTA models (Source: BAAI)
Towards multimodal model
Currently, AI systems are moving towards GPT-like multimodal and multitasking models to achieve artificial general intelligence (AGI). Experts believe there will be a rise in multimodal models in the coming months. Meanwhile, some are rooting for embodied AI, rejecting traditional bodiless models, such as neural networks altogether.
Unlike GPT-3 , Wu Dao 2.0 covers both Chinese and English with skills acquired by studying 4.9 terabytes of texts and images, including 1.2 terabytes of Chinese and English texts.
Google has also been working towards developing a multimodal model similar to Wu Dao. At Google I/O 2021, the search giant unveiled language models like LaMDA (trained on 2.6 billion parameters) and MUM (multitask unified model) trained across 75 different languages and 1000x times more powerful than BERT. At the time, Google CEO Sundar Pichai said that LaMDA, trained on only text, will soon shift to a multimodal model to integrate text, image, audio and video.
The training data of Wu Dao 2.0 include:
- 1.2 terabytes of English text data in the Pile dataset
- 1.2 terabytes of Chinese text in Wu Dao Corpora
- 2.5 terabytes of Chinese graphic data
Blake Yan, an AI researcher from Beijing, told South China Morning Post that these advanced models, trained on massive datasets, are good at transfer learning, just like humans. “Large -scale ‘pre-trained models’ are one of today’s best shortcuts to AGI,” said Yan.
“No one knows which is the right step,” said OpenAI on its GPT-3 demo blog post, “Even if larger ‘pre-trained models’ are the logical trend today, we may be missing the forest for the trees, and we may end up reaching a less determined ceiling ahead. The only clear aspect is that if the world has to suffer from ‘environmental damage,’ ‘harmful biases,’ or ‘high economic costs,’ not even reaching AGI would be worth it.”