There has been a recent growing interest in using language models to generate text for practical applications. Conglomerates are deploying their own models, and hundreds of organizations are deploying GPT-3 using APIs from OpenAI and others. Using these AI-based language models for business applications is also gaining steam. But while these language models are splendidly fluent, they also seem to possess a tendency to generate false statements that are potentially harmful disinformation. To minimize these risks, a new dataset named TruthfulQA has emerged.
What is TruthfulQA?
Researchers at the University of Oxford and OpenAI have recently created a dataset called TruthfulQA that contains questions some humans might answer incorrectly due to false beliefs or misconceptions. The researchers, while testing, found that, although the best-performing model was truthful on 58% of questions, it fell way short of human performance at 94%. The dataset aims at answering: How likely are models to make false statements across a range of contexts and questions? Recent studies have shown that the metrics used to benchmark AI and machine learning models tended to be inconsistent, irregularly tracked, and not particularly informative. TruthfulQA helps avoid these benchmarking pitfalls with a bank of questions from different sectors of life such as health, law, finance, and politics. These questions require models to avoid generating false answers that are learned from the text.
The team tested a number of different models on TruthfulQA, including GPT-3, its predecessor GPT-2, and open-source versions of GPT-3, GPT-Neo and GPT-J. It was also tested on UnifiedQA; a model fine-tuned on question answering tasks. To classify the answers from models as either true or false, the team also developed “GPT-judge,” an algorithm trained on answers to TruthfulQA questions from all of the evaluated models.
Image Source: TruthfulQA
How does it work?
TruthfulQA consists of two tasks that use the same sets of questions and reference answers.
The Task: To a given question, generate a 1-2 sentences answer.
Objective: The primary objective of this task is to measure the overall truthfulness of the answers, expressed as the percentage of the model’s answers that are true. The secondary objective is to understand what percentage of the model’s answers are actually informative.
Metrics: BLEURT, ROUGE, and BLEU are used to compare the model’s answer to each of the true and false reference answers. A score is then given by calculating maximum similarity to a true reference answer subtracted by the maximum similarity to a false reference answer.
Key Findings from TruthfulQA
The results showed that the larger the model, the worse its performance in terms of being truthful. Larger models only did better on questions that exactly match the syntax of TruthfulQA but do not probe misconceptions. On the other hand, the smallest model produces a true but uninformative answer. The intermediate answers were more informative but partly false or exaggerated. The largest model says something literally false, mimicking a human superstition.
When forced to choose from multiple responses rather than generate answers themselves, larger models also performed worse on TruthfulQA than smaller ones. No models significantly outperformed random guessing; even the best performing model gave false answers 42% of the time versus 6% for human participants. It showed that 87% of the humans’ answers were true on TruthfulQA. The researchers discovered that either the models did not learn the training distribution well enough or that the models’ training objectives encouraged false answers. It showed that scaling up models alone is less promising for improving truthfulness than fine-tuning them using training objectives.
Image Source: TruthfulQA
Limitations of TruthfulQA
TruthfulQA tests models on general and knowledge-based questions that are designed to eliminate imitative falsehoods. Even if a model performs well, it cannot completely conclude that it will be equally truthful on other kinds of tasks. For example, TruthfulQA does not cover long-form generation such as news articles or interactive settings such as extended chat with an adversarial human. Moreover, while the questions in TruthfulQA closely resemble real-world questions, they were not collected from a deployed system and hence may over- or underestimate truthfulness for the deployed system being tested.
Summing up
Although such work adds scepticism towards large language models and their training datasets, it also addresses the weaknesses of conventional language models currently being used and developed. While some of the best benchmarks performance scores today are achieved from large datasets and models, the results and effects of continuously adding enormous amounts of data into models still remain uncertain and yet to be completely discovered.