In May last year, Google announced a language model called ‘LaMDA’, or ‘Language Model for Dialogue Applications,’ at Google I/O 2021, and it has now come up with advances in the model of the same project. LaMDA is built by fine-tuning a family of Transformer-based neural language models specialised for dialog, with up to 137B model parameters.
Google says that it has been building the conversational skills of LaMDA for a long time. It added that the architecture produces a model that can be trained to read many words, work on how they relate to each other and predict what word comes next.
Metrics
As per the paper titled, “LaMDA: Language Models for Dialog Applications“, the benefits of model scaling with LaMDA are studied across three metrics.
- Quality
- Safety
- Groundedness
Image: LaMDA: Language Models for Dialog Applications
The research team observed that model scaling alone improves quality, but its improvements on safety and groundedness are far behind human performance. It also saw that combining scaling and fine-tuning improves LaMDA by a significant amount for all three metrics. “Even if the model’s performance remains below human levels in safety and groundedness, the quality gap to measured crowd worker levels can be narrowed”, added the team.
- Quality
The paper says that this is based on three components – sensibleness, specificity, and interestingness. The team collected annotated data that describes how sensible, specific and interesting a response is for a multiturn context. After that, they used these annotations to fine-tune a discriminator to re-rank candidate responses.
- Safety
It is aimed to reduce the number of unsafe responses. The team defined an illustrative set of safety objectives that captures the behaviour that the model can exhibit in a dialog. They use a demographically diverse set of crowd workers to label responses in multiturn dialogs for the objectives. These labels are used to fine-tune a discriminator to detect and remove unsafe responses.
- Groundedness
It is introduced to produce responses that are grounded in known sources where they contain verifiable external world information. The paper adds that though grounding in known sources does not guarantee factual accuracy, it can allow users to judge the validity of a response based on the reliability of its source and its reproduction.
Pre-Training
LaMDA undergoes two-stage training- pre-training and fine-tuning. The team has created a dataset of 1.56T words for the pre-training stage from public dialog data and other public web documents. Then the dataset is tokenised into 2.81T SentencePiece tokens and pre-trained using GSPMD to predict every next token in a sentence (given the previous tokens).
Fine-Tuning
Image: Google
Here, the team trains LaMDA to perform a mix of generative tasks for natural-language responses to given contexts. The paper adds, “The LaMDA generator is trained to predict the next token on a dialog dataset restricted to back-and-forth dialog between two authors, while the LaMDA classifiers are trained to predict the Safety and Quality (SSI) ratings for the response in context using annotated data.”
The LaMDA generator generates many candidate responses given the current multi-turn dialog context. The LaMDA classifiers help predict the SSI and Safety scores. The responses with low Safety scores are filtered out first, and then the remaining candidates are re-ranked by their SSI scores. The top result is selected as the chosen response.
Results
The team collected responses from the pre-trained model, fine-tuned model, human raters, and multi-turn two-author dialogs. Then, they asked a different set of human raters a bunch of questions to evaluate these responses against the three metrics of quality, safety, and groundedness.
The results show that LaMDA significantly outperforms the pre-trained model (in all dimensions and across all model sizes).
Image: Google
- Quality
The paper says that the quality metrics generally improve with the number of model parameters, with or without fine-tuning.
- Safety
It does not benefit from the model scaling alone but shows improvement with fine-tuning.
- Groundedness
As the model size increases, groundedness improves. The model can access external knowledge sources and effectively shift some of the load of remembering knowledge to an external knowledge source through fine tuning.