How Domain-Specific Pre-Training Can Outstrip General Language Models In Biomedical NLP Tasks

Published on August 6, 2020

by Sejuti Das

While it has been established that pre-training large natural language models like Google’s BERT or XLNet, can bring immense advantages in NLP tasks, these are usually trained on a general collection of texts like websites, documents, books and news. On the other hand, experts believe that pre-training models on domain-specific knowledge can provide substantial gains over the one that is trained on general knowledge, or mixed domain knowledge.

To facilitate this investigation, Microsoft researchers compiled a comprehensive biomedical NLP benchmark from publicly-available datasets, which then compared it with modelling choices for pre-training and its impacts on domain-specific applications — case in point: biomedicine. To which, researchers noted that domain-specific pre-training from scratch could be enormously beneficial for performing a wide range of specialised NLP tasks. Also, in order to accelerate research in biomedical natural language processing, Microsoft has released pre-trained task-specific models for the community.

Also Read: Top 8 Pre-Trained NLP Models Developers Must Know

Domain-Specific Pre-Training vs Mix-Domain Pre-Training

NLP models trained on general texts have proved to be beneficial; however, NLP models also have high-value application in specialised domains like biomedicine, finance and legal, which also has an abundance of domain-specific texts to train the models on. Traditionally, none of the biomedical related BERT models has been trained on pure domain-specific texts; thus, researchers will facilitate mixed domain pre-training of the language models.

According to researchers, mixed domain pre-training is beneficial only if the target application has little of its own texts. However, for specialised domains like biomedicine, which has 30 million papers in PubMed, it is believed to be a better strategy for a domain-specific pre-training with in-domain vocabulary. Also, domain-specific training will help the model to be specialised in that field, unlike other models which need to balance their various types of knowledge to perform a single task. This, though, can be outdone by continually training the model on domain knowledge, but not completely.

In order to facilitate the investigations around pre-training of biomedical NLP models, the researchers created a new benchmark — BLURB, also known as, Biomedical Language Understanding & Reasoning Benchmark. This benchmark consists of biomedical NLP tasks from publicly available datasets such as extraction of text relation, finding similarity of the sentence, answering questions, and classification tasks.

Firstly the datasets are grouped according to their task types and then to figure out the results of BLURB benchmarking, the researchers combined the average score for each task.

Source: BLURB Leaderboard, Microsoft — https://microsoft.github.io/BLURB/leaderboard.html

Further, as explained in the paper, when the model is being given an input token sequence, the language model tends to produce a series of vectors according to the context. Then, a task-specific model is being layered on top to gain the final result of the task-specific application. With task-specific training data, researchers can understand the model parameters involved to refine BERT using backpropagation.

Furthermore, to evaluate the impact of domain-specific pre-training of biomedical NLP, the researchers generated a collection of 14 million PubMed abstracts with 3.2 billion words of 21 GB. Following a standard pre-training procedure based on Tensorflow implementation, it took about five days for 62500 steps with a batch size of 8,192 on DGX-2 machines with 16 V100 GPUs.

For comparison, the researchers used BERT, Facebook’s RoBERTa, SciBERT, ClinicalBERT as well as BlueBERT and noticed that conducting domain-specific pre-training from scratch can outperform all the possible BERT models in biomedical NLP tasks. More specifically, not only it exceeds results in BERT models, but also surpassed RoBERTa, which has the most extensive pre-training corpus.

Wrapping Up

By challenging the prevailing assumption of training language models on out-domain knowledge, researchers showcased how domain-specific training can significantly outperform mix-domain trading as well as continual training, which in turn will lead to performing an extensive range of specialised NLP applications and tasks. Further, the researchers aim to explore more domain-specific training strategies by incorporating more specialised NLP tasks and extension of BLURB to high-value domains.

Read the whole paper here.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

The Impact of Lok Sabha Election on India’s AI Progress

Vidyashree Srinivas

The BJP aims to safeguard citizen safety and privacy, leaning towards regulation, while the Congress views AI advancements as an opportunity to create jobs.