SciBERT is a pre-trained BERT-based language model for performing scientific tasks in the field of Natural Language Processing (NLP). It was introduced by Iz Beltagy, Kyle Lo and Arman Cohan – researchers at the Allen Institute for Artificial Intelligence (AllenAI) in September 2019 (research paper).
Since the architecture of SciBERT is based on the BERT (Bidirectional Encoder Representations from Transformers) model, go through the BERT research paper if you are unaware of the state-of-the-art base model.
Overview of SciBERT
Training deep neural networks for NLP tasks requires a huge amount of labelled data. Though it is reasonable to gather such a large amount of data in several general domains, the data acquisition task seems difficult when it comes to scientific domains due to the expertise required for annotating scientific data. SciBERT, a model trained on a large corpus of scientific data, leverages unsupervised pre-training and significantly improves the performance of BERT model in scientific NLP tasks.
Background of SciBERT
The base model BERT is trained on two tasks:
- Predict randomly masked tokens
- Predict whether two sentences follow each other
SciBERT follows the same model architecture as BERT; the only difference is – it is trained on scientific data instead.
Vocabulary used by SciBERT
The original WordPiece vocabulary used by the BERT model is termed as BASEVOCAB while dealing with SciBERT. Another WordPiece vocabulary used by SCiBERT is called SCIVOCAB which is constructed on scientific corpus using SentencePiece library.
Core NLP tasks SciBERT can accomplish
- Named Entity Recognition (NER):
- PICO Extraction
- Text Classification (CLS)
- Relation Classification (REL)
- Dependency Parsing (DEP)
Variants of SciBERT
There are four versions of the SciBERT model based on:
(i) cased or uncased
(ii) BASEVOCAB or SCIVOCAB
The two models using BASEVOCAB are fine tuned from the corresponding BERT-base models. The other two models which use SCIVOCAB are trained from scratch.
Practical implementation
Here’s a demonstration of NCBI disease corpus task – a Named Entity Recognition (NER) task in the biomedical field. The data used is a part of a collection of 793 PubMed abstracts having annotated disease entities. Every token entity has a ‘B-’ (Beginning) tag indicating if the token is at the start of the entity or an ‘I-’ (Inside) tag indicating that the token is inside the annotation while the ‘O’ tag suggests that the token is not a named entity.
The code has been implemented in Google colab with Python version 3.7.10. Step-wise explanation of the code is as follows:
- Get the NCBI data from The AllenAI’ s GitHub repository
%%bash DATADIR="NCBI_disease" #Name given to the directory if test ! -d "$DATADIR";then echo "Creating $DATADIR dir" #print statement mkdir "$DATADIR" #Create a directory if it doesn’t exist cd "$DATADIR" #change the current working directory #download development set data wget https://raw.githubusercontent.com/allenai /scibert/master/data/ner/NCBI-disease/dev.txt #download training data wget https://raw.githubusercontent.com/allenai /scibert/master/data/ner/NCBI-disease/test.txt #download testset data wget https://raw.githubusercontent.com/allenai /scibert/master/data/ner/NCBI-disease/train.txt fi #end of ‘if’ condition
2) Clone the GitHub repository of bert-sklearn, a scikit-learn wrapper for fine-tuning the BERT model
!git clone -b master https://github.com/charles9n/bert-sklearn
Change the directory and install bert-sklearn
!cd bert-sklearn; pip install .
3) Import required libraries
import os import math import random import csv import sys import numpy as np import pandas as pd from sklearn import metrics from sklearn.metrics import f1_score, precision_score, recall_score from sklearn.metrics import classification_report import statistics as stats from bert_sklearn import BertTokenClassifier
4) Define a function to read tsv file (‘tsv’ stands for ‘tab-separated values’)
def read_tsv(fname, quotechar=None): #open the utf-8 encoded file in read mode with open(fname, "r", encoding='utf-8') as f: return list(csv.reader(f, delimiter="\t", quotechar=quotechar))
csv.reader() function generates a reader object to extract data from a csv file. It takes each line of the file and makes a list of all the columns.
You then need to choose just the column from which you want the variable data.
5) Define a function to flatten the array of tokens
def flatten(l): return [item for sublist in l for item in sublist]
6) Define a function to read the data file in CoNLL-2003 shared task format.
def read_CoNLL2003(fname, index=3): # Read the file lines = open(fname).read().strip() # Find sentence-like boundaries by splitting on seeing 2 newline characters. lines = lines.split("\n\n") # Split on new lines lines = [l.split("\n") for l in lines] # Get tokens tokens = [[l.split()[0] for l in line] for line in lines] # Get output labels or tags labels = [[l.split()[index] for l in ln] for ln in lines] #Convert the data comprising tokens and corresponding labels to a dataframe dt = {'tokens': tokens, 'labels': labels} df=pd.DataFrame(data=dt) return df
7) Define a function to read train, dev and test set data
DATADIR = "NCBI_disease/" def get_data(trainfile=DATADIR + "train.txt", devfile=DATADIR + "dev.txt", testfile=DATADIR + "test.txt"): #Read the train,dev and test files provided as parameters train = read_CoNLL2003_format(trainfile, index=3) dev = read_CoNLL2003_format(devfile, index=3) #Combine the train and dev set train = pd.concat([train, dev]) print("Train and dev data: %d sentences, %d tokens"% (len(train),len(flatten(train.tokens)))) test = read_CoNLL2003_format(testfile, index=3) print("Test data: %d sentences, %d tokens"% (len(test),len(flatten(test.tokens)))) return train, test
8) Perform train-test split
#Store the train and test set data returned by the get_data() function defined in step (7) train, test = get_data() #Separate out features (tokens) and labels of training and test set X_train, y_train = train.tokens, train.labels X_test, y_test = test.tokens, test.labels print(len(train)) #Print the number of instances in training set labels = np.unique(flatten(y_train)) #get unique labels labels = list(label_list) #form a list of labels print("\nNER tags:",labels) #print the list containing unique labels
Output:
Train and dev data: 6347 sentences, 159670 tokens Test data: 940 sentences, 24497 tokens 6347 NER tags: ['B-Disease', 'I-Disease', 'O']
9) See the initial records of training data
train.head()
10) Initialize the SciBERT model
Out of the four versions of SciBERT, here we are using BASEVOCAB CASED version.
%%time #to record execution time model = BertTokenClassifier (bert_model='scibert-basevocab-cased', max_seq_length=178, epochs=3, #gradient accumulation gradient_accumulation_steps=4, learning_rate=3e-5, train_batch_size=16,#batch size for training eval_batch_size=16, #batch size for evaluation validation_fraction=0., #ignore the tokens with label ‘O’ ignore_label=['O'])
The ‘max_seq_length’represents the length of a token sequence that the model can handle. BERT’s limit is 512 tokens but here we explicitly limit it to 178 (176 tokens + 2 for [CLS] and [SEP] delimiters used by BERT model)
#Print the model’s configuration print(model)
Sample output:
11) Fit the fine-tuned BERT model on training data
model.fit(X_train, y_train)
Sample output:
12) Make predictions on test data
y_preds = model.predict(X_test)
13)Print classification report on model’s performance
print(classification_report(flatten(y_test), flatten(y_preds)))
Sample output:
Note: The outputs may vary a bit for each execution of the code and also depending on the execution environment you choose.
- Code source
- Google colab notebook of the above implementation can be found here.
References
Refer to the following sources to have in-depth understanding of the SciBERT model: