With one toe in the dream of AGI, GPT-3 has become a benchmark for natural language processing models. OpenAI’s DALL E, AI21’s Jurassic-1, Google’s LaMDA, Microsoft’s Turing NLG and BAAI’s Wu Dao 2.0 are GPT-3’s successors, claimed to be equivalent or even better than the popular model. In addition, AI2 just released an upgraded version of their open-source question-answering model, Macaw, declaring it to outperform GPT-3.
All about Macaw
Macaw is a QA model based on a multi-angle approach that leverages different combinations of inputs and outputs to achieve the results. It has successfully tackled various question types, including general knowledge, meta reasoning, hypothetical, and story understanding. Moving away from the traditional reliance on a structured database to find answers, Macaw is situated on pre-trained models for better versatility. Instilling the model with different permutations improves its performance on different tasks while making it more flexible.
Sign up for your weekly dose of what's up in emerging technology.
Macaw uses a combination of ‘slots’ as the inputs and outputs. The slots are, Context, Question, Multiple-choice options, Answer and Explanation. Then, it uses different ‘angles’ or combinations of these slots to generate different and accurate outputs. Macaw’s paper explains it as one that “produces high-quality answers to questions far outside the domain it was trained on” and “generates explanations as an optional output (or even input) element”. Of course, these explanations tend to be of a lower quality than the answers, but it is of the few models with the capacity to do so.
Macaw vs GPT-3
Download our Mobile App
Macaw has been trained on 11 billion parameters. This is in comparison to GPT-3, which has 175 billion, Jurassic-1 on 178 billion, Pang Gu Alpha on 200 billion and Wu Dao 2.0 on 1.75 trillion. Yet, AI2 claims despite Macaw’s smaller magnitude of parameters; it outperformed GPT-3 by over 10% on a suite of 300 challenge questions.
AI2 has created a set of 300 challenges to stop the system and compare its performance to existing models. Macaw scored 75%, while GPT-3 and Jurassic-1 scored 65%, and Google’s T5-CBQA scored 57%. A few examples of how GPT-3 and Macaw responded to questions are illustrated below (read more here).
How would you make a marriage stronger?
GPT-3: I would make sure that I was the one who was always right.
If a bird didn’t have wings, how would it be affected?
Macaw: It would be unable to fly
GPT-3: It would be a bad bird.
For Macaw alone, the team proved its reasoning, common sense, comparison, false presuppositions, meta reasoning, hypotheticals and entity substitution abilities through examples. For instance, it gave correct answers when asked similarly sounding different questions.
Q: If my house did not have a roof, and it rained, what would happen to my bed upstairs?
A: it would get wet
Q: If my house had a roof, and it rained, what would happen to my bed upstairs?
A: it would stay dry
Alternatively, the model is weak in entity tracking struggles with stories about a changing world state. For instance,
Q: My house is red. I painted my house white. What colour is my house now?
A: it is now white [correct]
Q: My house is blue. I painted my house red. What colour was my house before I painted it?
A: blue [correct]
Q: My house is red. My neighbour’s house is blue. I painted my house white. What colour is my house now?
A: it’s the same colour as my neighbour’s house [wrong]
Q: My house is red. I painted my neighbour’s house white. What colour is my house now? A: half red, half white [wrong]
Most models scored high on general knowledge questions and lowest on common sense and trick questions. The group found GPT-3 and T5-CBQA to produce impressive answers to some questions but made more mistakes than Macaw. An example discussed is “When did Tom Hanks land on the moon?” GPT-3 said 1995 based on the film Apollo 13. Macaw noted that Tom Hanks has never been to the moon as far as we know.
Dr Oren Etzioni, Chief Executive Officer at AI2, has noted that Makaw is not supposed to replace GPT-3 but is a new step in the NLP research. GPT-3 is a remarkable NLP model, but it is out of reach for many organisations given its massive size and fee. On the other hand, Macaw is catered towards building AI systems that can read, reason, and explain their answers.
Etzioni spoke about how GPT-3 is amazing in an interview with TechCrunch, but it only came out 18 months ago, and access is limited. And while it has remarkable capacities, you can do more with less. “Sometimes you have to build something with 175 billion parameters to say, well, maybe we can do this with 10 billion,” he said.
The cost of the GPT-3 dream
Large scale models will be useful, but smaller models have a better chance of being deployed in day to day cases. In AIM’s recent council post, Padmashree Shagrithaya, the Global Head of Analytics and Data Science at Capgemini, discussed the impact of such large NLP models. She illustrated the environmental cost of GPT-3 through examples. “An AI language-processing system generates anywhere between 1,400 to 78,000 pounds of emission. This is equivalent to 125 round trip flights between New York and Beijing”. Additionally, “Carbontracker suggested training GPT-3 just once requires the same amount of power used by 126 homes in Denmark every year. It is also the same as driving a car to the moon and back.”
“While innovation is the basis on which a society moves forward, we must also be conscious of the cost such ‘innovation’ brings. The need of the hour is to strike a balance between the two,” she concluded. A smaller but equally effective model like Macaw could possibly help create this balance.