NLP Models have shown tremendous advancements in syntactic, semantic and linguistic knowledge for downstream tasks. However, that raises an interesting research question — is it possible for them to go beyond pattern recognition and apply common sense for word-sense disambiguation?
Thus, to identify if BERT, a large pre-trained NLP model developed by Google, can solve common sense tasks, researchers took a closer look. The researchers from Westlake University and Fudan University, in collaboration with Microsoft Research Asia, discovered how the model computes the structured, common sense knowledge for downstream NLP tasks.
According to the researchers, it has been a long-standing debate as to whether pre-trained language models can solve tasks leveraging only a few shallow clues and their common sense of knowledge. To figure that out, researchers used a CommonsenseQA dataset for BERT to solve multiple-choice problems.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
In this research, the analysts used CONCEPTNET focusing on attention heads for measuring the common sense knowledge in BERT.
Also Read: How I Used BERT to Analyse Twitter Data
BERT Model Used For CommonSenseQA
To facilitate the process, the researchers chose the multiple-choice question answering dataset, which has been built on CONCEPTNET knowledge graph — CommonSenseQA. This dataset is known for comprising a broad set of triples, of the relation pair — source concept, relation, target concept, considering the concept is ‘Bird’, and the reaction type is ‘at location.’
Case in point — with one question and five answers (as shown in the figure above) the model would be asked to select one of the answers as an accurate output. To solve this, conventionally the NLP models would score each of the answers based on sentence-level hidden vector, and the one with the highest score would be the output.
However, to examine the presence of common sense in BERT, the researchers examined the common sense link between the question and the answer, which is then manually annoyed in the provided data set.
The researchers termed the source concept as ‘question concept’ and the target concept as the ‘answer concept.’ With each question [q], there are five answers [a1 … a5], the researchers then linked the question with each response to obtain five concatenated sequences [s1 … s5], respectively.
BERT Architecture.
Further, researchers used special symbols in each of the sentences — [CLS] in the beginning; [SEP] in between question and answer, and in the end. With BERT having a stacked Transformer layer to codify each sentence and the last layer of the [CLS] token is then used for linear classification and the answer with the highest score is chosen as the accurate output.
Also Read: How Syntactic Biases Help BERT To Achieve Better Language Understanding
Does BERT Contain Structured Commonsense Knowledge?
To facilitate the analysis, the researchers assessed the common sense links using attention weight and respective attributing score. Attention weights are significant while producing next layer representation, but it becomes insufficient to identify the behaviour of the attention head by disregarding the value of the hidden vector. And that’s why researchers added the supplement of attribution scores to interpret the contribution of each input in backpropagation. Both of the values allowed the researchers to understand the common sense link in BERT better.
Firstly, the researchers conducted a set of experiments to figure out if BERT can actually capture the common sense knowledge. According to the paper, it can only be determined “if the link weight from the answer accepts the question concept if higher than the answer concept to other words of the question.”
Secondly, to evaluate link weights, the researchers calculated the most associated word with maximum link weights for each of the attention heads in the layers. For this, average accuracy among all attention heads was measured as well as the accuracy of the most accurate head.
The figure shows the average and maximum accuracy of the most associated word of BERT for different common-sense relations.
Concludingly, if the accuracy of most associated words significantly outstrips the random baseline, it indicates that “relevant question concept plays a significant role in BERT encoding without fine-tuning.” But with fine-tuning when BERT-FT surpasses BERT, it demonstrates that with supervised learning, common sense knowledge can be enhanced in BERT.
Also Read: How To Build A BERT Classifier Model With TensorFlow 2.0
Wrapping Up
Post qualitative and qualitative analysis to understand how the large NLP model, BERT, solves CommonSenseQA tasks. The results denoted that BERT indeed encodes structured commonsense knowledge, and is also able to use the same for certain degrees of downstream NLP tasks. Further researchers have noted that fine-tuning BERT can help enhance this knowledge on higher layers of functions. With the release of the paper, the researchers aim to encourage further leveraging of BERT’s underlying mechanisms for real-world innovations.
Read the whole paper here.