Along with the announcement of the chatbot, the researchers shared the recipe behind the building and deploying of the same. They stated that for the first time ever, this chatbot has the ability to blend a diverse set of conversational skills in a single system, including empathy, knowledge and personality.
Building an open-domain chatbot is one of the complex and challenging domains in machine learning. In order to build a high-performance chatbot, the researchers worked on scaling neural models in the number of parameters as well as the size of the data they are trained on. The researchers stated, “Good conversation requires a number of skills that an expert conversationalist blends in a seamless way, providing engaging talking points, listening to their partners, as well as displaying knowledge, empathy and personality appropriately while maintaining a consistent persona.”
According to the researchers at Facebook AI, the recipe of the new chatbot incorporates not only large-scale neural models, with up to 9.4 billion parameters or 3.6x more than the largest existing system but also equally important techniques for blending skills and detailed generation.
The main steps of building this chatbot are scale, blending skills and generation strategies.
To create a high-performance chatbot, the first step is the large-scale training. For this, the researchers pre-trained large Transformer neural networks up to 9.4 billion on large amounts of conversational data. They used previously available public domain conversations that involved 1.5 billion training examples of extracted conversation.
For blending skills, the researchers selected specific tasks that make the model focus on personality and engagingness, knowledge, and empathy. They used a recently introduced novel task called Blended Skill Talk (BST) set-up for training and evaluating the desirable skills. BST targets these aspects by providing training data and initial conversational context. Blended Skill Talk (BST) not only emphasised the desirable traits but also showed that this tuning can minimise undesirable traits such as toxicity, learnt from large corpora.
According to the researchers, BST consists of the following skills:-
- Engaging use of personality
- Engaging use of knowledge
- Display of empathy
- Ability to blend all three seamlessly
To avoid repetitions during a conversation by the agents, researchers usually implement a number of generation strategies such as beam search, next token sampling, and n-gram blocking.
However, in this work, the researchers consider three types of architectures, which are retrieval, generative, and retrieve-and-refine (RetNRef) models. For the implementations of retrieval systems and generator, they used poly-encoder architecture and Byte-Level BPE tokenisation trained on the pre-training data, respectively.
From a given dialogue history as input, a retrieval system select the next dialogue utterance by scoring a large set of candidate responses and outputting the highest-scoring one. Then, a standard Seq2Seq Transformer architecture was employed to generate responses rather than retrieve them from a fixed set. And lastly for retrieve and refine, the researchers considered two variants for the retrieval step, they are dialogue retrieval and knowledge retrieval.
For pre-training, the researchers used pushshift.io Reddit dataset, which is a variant of Reddit Discussions. According to the researchers, this dataset is a good candidate for helping train a dialogue model in the open-domain case.
For fine-tuning, the researchers used 3 different types of datasets, which are ConvAI2, Empathetic Dialogues (ED) and Wizard of Wikipedia (WoW). ConvAI2 includes training data of 140k utterances, involves paired crowd workers having a conversation where they get to know each other. Empathetic Dialogues dataset consists of 50k utterances of crowd worker conversations grounded in an emotional situation, and the Wizard of Wikipedia task involves discussing a given topic in-depth, where the goal is to both engage the partner as well as display expert knowledge.
For the evaluation of the chatbot, the researchers benchmarked its performance against Google’s Meena chatbot through pairwise human evaluations. They further utilised the ACUTE-Eval method in order to show a series of dialogues between humans paired with each respective chatbot.
The researchers released 90M, 2.7B and 9.4B parameter pre-trained and fine-tuned generative models as well as provided a script for interacting with the bot with safety filtering built-in. According to the researchers, this method has taken a step further and gained improved performance in terms of engagingness and humanness. However, there are still various issues such as non-trivial repetition, knowledge and factual correctness, contradiction and forgetfulness, among others with the model, which needs to be mitigated in future studies.
Read the paper here.