Earlier this week, the Chinese internet giant Baidu released PLATO-XL, a pre-trained dialogue generation model with up to 11 billion parameters. It adopts the architecture of a unified transformer with high computation and parameter efficiency.
PLATO-XL carries out multi-party aware pre-training to better distinguish the characteristic information in social media conversation. As a result, it achieves superior performance compared to other approaches in both English and Chinese. Furthermore, PLATO-XL has effectively reduced the inconsistency phenomenon in multi-turn conversations, thanks to the multi-party aware pre-training.
Soon, the company plans to release the source code on GitHub. “We will release our source code together with the English model at GitHub, hoping to facilitate frontier research in dialogue generation,” said Baidu researchers.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
We will release our source code together with the English model at GitHub soon.
— Baidu Research (@BaiduResearch) September 27, 2021
Language Models Vs Dialogue Generation Models
The efficiency of the pre-training paradigm has been widely recognized in natural language processing (NLP) – the likes of Switch Transformer, GPT-3, BERT, XLNet, RoBERTa, LaMDA, etc. – where large-scale transformer models are trained with massive plain texts. However, most of these language models follow the trend of enlarging the model size, dataset set, or the amount of compute used for training.

More specifically, in the case of OpenAI’s GPT-3 model with 175 billion parameters, the language model shows strong zero-shot learning capacities without task-specific fine-tuning on downstream tasks. That is where the dialogue generation model comes into the picture. Compared to general language models, dialogue generation models are usually pre-trained with human-like conversations collected from social media platforms – Reddit, Twitter, etc.
Some of the popular dialogue generation models include Microsoft’s DialoGPT, Google’s Meena, Facebook’s Blender and Baidu’s PLATO-2. In no time, these models have also been scaled up to billions of parameters and taken advantage of much more social media conversations for pre-training. However, in dialogue generation, there still lacks a clear conclusion about the correlation between model scale and conversation quality, alongside other limitations like unfair biases, misleading information, the inability of continuous learning, etc.
Hopefully, the recently launched PLATO-XL will continue to improve the conversation quality on fairness and factuality.

For example, DialoGPT has three model sizes – 117 million, 345 million and 762 million parameters. Out of these, the 345 million is said to have obtained the best performance in their evaluations. Similarly, in the case of Blender, the 2.7 billion parameters achieved better performance as compared to the one with 9.4 billion parameters.
Baidu researchers believe that the conversation quality might benefit from the enlarged model scale with an appropriate pre-training design. PLATO-XL fits perfectly in the equation. Besides the open-domain conversations, the model includes two common conversational tasks – knowledge grounded dialogue and task-oriented conversation.
Plus, the researchers have also explored the ability of PLATO-XL as the foundation model of conversational AI. Interestingly, their experiments indicated that PLATO-XL could outperform other dialogue generation models across multiple conversational tasks.

Inside PLATO-XL
PLATO-XL adopts the unified architecture that allows simultaneous modelling of dialogue, understanding and response generation, which is more parameter efficient. Furthermore, a flexible mechanism of self-attention mask enables bidirectional encoding of dialogue history and unidirectional decoding of responses. Moreover, the unified transformer architecture proves to be efficient in the training of dialogue generation.
Given the variable lengths of conversation samples, many invalid computations are caused by padding in the training process. Thanks to the unified transformer, it can greatly improve the training efficiency through the effective sorting of the input samples.

PLATO-XL, with 11 billion parameters, includes two dialogue models – Chinese and English. In addition, over 100 billion tokens of data are used in pre-training. The dialogue model is implemented on PaddlePaddle, a deep learning platform developed by Baidu. To train such a large model, PLATO-XL has adopted the techniques of gradient checkpoint and shared data parallelism provided by FleetX, PaddlePaddle’s distributed training library. It is trained on a high-performance GPU cluster with 256 NVIDIA Tesla V100 32G GPU cards.
Wrapping up
With this latest development, Baidu’s PLATO-2 has been upgraded to PLATO-XL, with over ten billion parameters, making it the world’s largest Chinese and English dialogue generation model. The researchers believe that it achieves superior performance in open-domain conversion and raises the expectation of what hundred-billion or even trillion parameter dialogue models – Blender, DialoGPT, EVA, PLATO-2, etc. – could do. Moreover, PLATO-XL proves significantly better performance than the current mainstream commercial chatbots.

Furthermore, Baidu’s PLATO-XL expands new horizons in open-domain conversations, which is considered one of the most challenging tasks in NLP. Touted as the largest pre-training model for English and Chinese dialogue, PLATO-XL has hit a new level of conversational consistency and factuality – one step closer to the future of human-like learning and conversational abilities.