After GPT-3 showcased SOTA performance on various NLP tasks, researchers from UC Berkeley, Columbia University, UChicago, and UIUC have proposed a new test to measure its multitasking capabilities.
Considering that transformer-based models like GPT-3 have been trained on massive text corpora from numerous websites, it has managed to display excellent results for NLP benchmarks and specialised topics. However, it was critical to understand the capability of these language models at grasping their knowledge and understanding of various domains. Thus, to check GPT-3’s multitask accuracy, researchers proposed a test covering 57 tasks, including US history, elementary maths, computer science and law, to name a few.
In a recent paper, researchers have stated that the benchmark has been designed to measure the acquired knowledge during their training and ranged the tasks from elementary level to advanced level problem-solving capabilities.
Also Read: GPT-3 Vs BERT For NLP Tasks
How Does The Multitask Test Work?
To facilitate this accuracy test, researchers created a massive multitask test which included approximately 15,000 multiple choice questions gathered manually from available online sources. The gathered questions are then categorised into — few shot development set; validation set; and a test set with the maximum number of questions.
According to researchers, considering the test aggregates various subjects of different difficulty levels, they decided to test its capability beyond straightforward common sense and limited linguistic knowledge. For this, the researchers tested the model on real-world text understanding for measuring the ability of the models to extract useful knowledge from massive training data fed onto it.
To assess GPT-3, the researchers included Unified QA and the entire family of GPT-3 to compare its results. It included a small model with 2.7 billion parameters, a medium one with 6.7 billion, large with 13 billion and X-large with 175 billion parameters, and computed classification accuracy across all tasks.
Models tested on four broad disciplines and shared value in percentage.
For a few-shot prompt — the researchers fed the model GPT-3 with a few prompts with five demonstration examples with answers to the prompt before asking the questions. And for zero-shot learning, they appended the question to the prompt.
Left: examples of few-shot learning and inferences using GPT-3. The answer underlined in “blue” is the response from GPT-3. Right: Performance on commonsense, linguistics and the proposed multitask test.
Here it can be noticed that GPT-3 produced probabilities for the token A; B; C and D, and the one with the highest probability will be treated as the prediction. To obtain a consistent evaluation of the models, the researchers also created a dev set with five fixed few-shot learning examples for each subject.
To check the models’ size and accuracy, each GPT-3 model, whether it be small, medium, large or XL, have been compared on their few-shot accuracy. Here the researchers noted that three smaller GPT-3 and Google’s T5 model have near-random accuracy of approximately 25%. However, GPT-3 L with 13 billion parameters showcased an accuracy of 37.7%, while GPT-3’s XL model with 175 billion parameters turned out to have an accuracy of 43.9%; in both few shot and zero-shot learning.
These results showcase that bigger the model size, higher will be the accuracy of the model on multitasking tests. To test this theory, they also compared the UnifiedQA models, which displayed accuracy of 38.5% without any fine-tuning. Thus it has been less accurate than the few-shot GPT-3 XL but more accurate than zero-shot GPT-3 XL.
It has further been proved even the smallest UnifiedQA model with just 60 billion parameters has better accuracy of 30% than others. Thus, it has been established that model size can play a significant part in achieving the accurate performance of the model.
With that being said, it’s not the only criteria for achieving the highest performance in a language model. Here, the researchers have decided to compare disciplines for testing accuracy. It has been discovered that GPT-3 has uneven accuracy and several substantial knowledge gaps.
Evaluating it on all 57 tasks, researchers noted that GPT-3 doesn’t grasp knowledge as humans do; instead, it has its own unusual order to follow. It showed below expert-level performance for the tasks, ranging from 26% for college chemistry and 69% for US foreign policy. And overall, GPT-3 displayed poor results on severe procedural problems.
Further breakdown on each task — the GPT-3 showed less accuracy for STEM subjects like maths and calculation over the ones which required verbal knowledge. A lot of this could be attributed to its capability of acquiring more declarative learning than procedural learning. While it exhibited 47.4% accuracy for college medicine and 35% accuracy for college maths, it lowered down its accuracy to 29.9% when it comes to elementary maths with more calculations.
Additionally, the researchers evaluated the calibration of GPT-3, which was critical to trust its prediction by testing its average confidence for estimating actual accuracy for the tasks. Here, the researchers found that GPT-3 isn’t calibrated and thus providing miscalibrated forecasts with an error of 19.4%. Therefore they believe that there is an immense opportunity for improving the model calibration.
With 57 designed tasks, the researchers aimed to measure the multitasking capabilities of GPT-3 however after experimenting and evaluating it has been noted that it is very novel for the model to make meaningful progress on the test and have lopsided performance without excelling at any. Not only GPT-3 showcased low accuracy for calculation-based tasks but also for social-based tasks including morality and law.
Thus it can be said that while GPT-3 comes with an extensive breadth of knowledge, it doesn’t master any single subject, and comes with many knowledge blindspots with uneven accuracy. With this proposed benchmark, the researchers aimed to help others in pinpointing the limitations of models and making it possible to get an accurate view of its SOTA capabilities.
Read the whole paper here.