While it has been established that pre-training large natural language models like Google’s BERT or XLNet, can bring immense advantages in NLP tasks, these are usually trained on a general collection of texts like websites, documents, books and news. On the other hand, experts believe that pre-training models on domain-specific knowledge can provide substantial gains over the one that is trained on general knowledge, or mixed domain knowledge.
To facilitate this investigation, Microsoft researchers compiled a comprehensive biomedical NLP benchmark from publicly-available datasets, which then compared it with modelling choices for pre-training and its impacts on domain-specific applications — case in point: biomedicine. To which, researchers noted that domain-specific pre-training from scratch could be enormously beneficial for performing a wide range of specialised NLP tasks. Also, in order to accelerate research in biomedical natural language processing, Microsoft has released pre-trained task-specific models for the community.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Domain-Specific Pre-Training vs Mix-Domain Pre-Training
NLP models trained on general texts have proved to be beneficial; however, NLP models also have high-value application in specialised domains like biomedicine, finance and legal, which also has an abundance of domain-specific texts to train the models on. Traditionally, none of the biomedical related BERT models has been trained on pure domain-specific texts; thus, researchers will facilitate mixed domain pre-training of the language models.
According to researchers, mixed domain pre-training is beneficial only if the target application has little of its own texts. However, for specialised domains like biomedicine, which has 30 million papers in PubMed, it is believed to be a better strategy for a domain-specific pre-training with in-domain vocabulary. Also, domain-specific training will help the model to be specialised in that field, unlike other models which need to balance their various types of knowledge to perform a single task. This, though, can be outdone by continually training the model on domain knowledge, but not completely.
In order to facilitate the investigations around pre-training of biomedical NLP models, the researchers created a new benchmark — BLURB, also known as, Biomedical Language Understanding & Reasoning Benchmark. This benchmark consists of biomedical NLP tasks from publicly available datasets such as extraction of text relation, finding similarity of the sentence, answering questions, and classification tasks.
Firstly the datasets are grouped according to their task types and then to figure out the results of BLURB benchmarking, the researchers combined the average score for each task.
Source: BLURB Leaderboard, Microsoft — https://microsoft.github.io/BLURB/leaderboard.html
Further, as explained in the paper, when the model is being given an input token sequence, the language model tends to produce a series of vectors according to the context. Then, a task-specific model is being layered on top to gain the final result of the task-specific application. With task-specific training data, researchers can understand the model parameters involved to refine BERT using backpropagation.
Furthermore, to evaluate the impact of domain-specific pre-training of biomedical NLP, the researchers generated a collection of 14 million PubMed abstracts with 3.2 billion words of 21 GB. Following a standard pre-training procedure based on Tensorflow implementation, it took about five days for 62500 steps with a batch size of 8,192 on DGX-2 machines with 16 V100 GPUs.
For comparison, the researchers used BERT, Facebook’s RoBERTa, SciBERT, ClinicalBERT as well as BlueBERT and noticed that conducting domain-specific pre-training from scratch can outperform all the possible BERT models in biomedical NLP tasks. More specifically, not only it exceeds results in BERT models, but also surpassed RoBERTa, which has the most extensive pre-training corpus.
By challenging the prevailing assumption of training language models on out-domain knowledge, researchers showcased how domain-specific training can significantly outperform mix-domain trading as well as continual training, which in turn will lead to performing an extensive range of specialised NLP applications and tasks. Further, the researchers aim to explore more domain-specific training strategies by incorporating more specialised NLP tasks and extension of BLURB to high-value domains.
Read the whole paper here.