While GPT-3 has been bragging about achieving state-of-the-art performance on Complex NLP tasks with hundred billion parameters, researchers from the LMU Munich, Germany have proposed a language model who can show similar achievements with way fewer parameters.
GPT-3 has been trained on 175 billion parameters and thus showed remarkable few-shot abilities, and by reformulating a few tasks and prompting inputs, it also showed immense capabilities on SuperGLUE benchmark. However it comes with two most significant drawbacks — large models aren’t always feasible for real-world scenarios, and with the context window of these monstrous models is limited to a few hundred tokens, it doesn’t scale more than a few examples.
And thus, the researchers proposed an alternative to priming, i.e. Pattern Exploiting Training (PET), which merges the sea of reformulating tasks with Cloze questions along with regular gradient-based fine-tuning. PET required unlabelled data, which is easier to gather than labelled data, thus making it usable for real-world applications. The most significant advantage it provides is when the outcome predicted by these large language models like GPT-3 corresponds to a single token in its vocabulary, which gets challenging for many NLP tasks.
Also Read: Can GPT-3 Pass Multitask Benchmark?
Pattern Exploiting Training (PET)
In this study, the researchers modified the PET to predict more than one token to outperform GPT-3 on SuperGLUE with 32 training examples and only 0.1% of its parameters. The researchers showcased how PET leverages masked language models to assign probabilities to sequences of texts.
To facilitate this, the researchers considered mapping the inputs to outputs for which PET required pattern-verbaliser pairs (PVPs), which consist of a pattern that maps inputs to Cloze questions containing a single mask and a verbaliser that maps each output to a single token representing tasks.
Application of pattern-verbaliser pairs for recognising textual entailment: an input is converted into a cloze question for each output is derived from the probability of being a plausible choice for the masked position.
The PET has to derive the probability of being the accurate output from the probability of being the correct token at the masked position. For the given task, detecting PVPs that perform well has been a challenging task with the absence of a large development set, and that’s why pattern exploiting training has been the preferred choice for enabling a combination of multiple PVPs.
For this, for each PVP, a masked language model is fine-tuned on training examples, and the ensemble of fine-tuned MLMs is then used to annotate a set of unlabelled data with soft labels on probability distribution. Further, the soft-labelled dataset is leveraged for training a regular sequence classifier.
While carrying this out, researchers noted that PET comes with a limitation of the verbaliser where it struggles to map each possible output to a single token for many tasks. And thus, the researchers generalised verbalisers to function which required some modification on inference and training.
PET vs GPT-3 on SuperGLUE
For comparing the performances between GPT-3 and PET, the researchers chose SuperGLUE as a benchmark. While carrying this out, researchers noted that PET cannons are evaluated on the exact same training data as GPT-3. This is a lot because GPT-3 leverages different training data for different tasks. So to make it a level playing field, researchers created a new training data set of 32 examples, randomly selected using the fixed random seed for each task.
In addition to that, researchers also developed a set of 20,000 unlabelled examples for each task, removing all the labels. And, the resulting set of examples that will be used for training and unlabelled examples as FewGLUE.
To perform the tasks, the researchers used BoolQ, a QA task; CB and RTE, the text entailment tasks; COPA task; MultiRC, another QA task; ReCoRD, a Cloze question task. And as the sizable underlying model for PET, the researchers opted for ALBERT.
PET is then run on the FewGLUE training set for all SuperGLUE tasks; however, for COPA, WSC and ReCoRD, the researchers proposed modification of PET. The proposed method is then trained on all tasks except COPA, ESC and ReCoRD, which simply resumed the regular results of PET.
After experimenting, the results highlight that ALBERT with PET highlights similar performance as GPT-3, which is larger by a factor of 785. On average, the proposed method performs 18 points better than GPT-3. Showcasing the break up of the results, the proposed model — PET, similar to GPT-3, doesn’t perform on WiC, and only for the ReCoRD task, the GPT-3 showcased consistent performance better than PET.
With this study, the researchers showcased how it is possible to achieve a few shot performance on NLP tasks similar to GPT-3 outcome using PET. PET reformulates the tasks as Cloze questions and trains the models for different reformulation. To make this happen, the researchers modified PET to be used for tasks that require multiple tokens.
Although the results highlight that the proposed method has outperformed GPT-3 on many tasks, it didn’t manage to showcase smiler results on every task given. However, such a study indeed opens up channels and opportunities for pushing AI boundaries with modest hardware.
Read the whole paper here.