MITB Banner

Guide To SciBERT: A Pre-trained BERT-Based Language Model For Scientific Text



 SciBERT is a pre-trained BERT-based language model for performing scientific tasks in the field of Natural Language Processing (NLP). It was introduced by Iz Beltagy, Kyle Lo and Arman Cohan – researchers at the Allen Institute for Artificial Intelligence (AllenAI) in September 2019 (research paper).

Since the architecture of SciBERT is based on the BERT (Bidirectional Encoder Representations from Transformers) model, go through the BERT research paper if you are unaware of the state-of-the-art base model.

Overview of SciBERT 

Training deep neural networks for NLP tasks requires a huge amount of labelled data. Though it is reasonable to gather such a large amount of data in several general domains, the data acquisition task seems difficult when it comes to scientific domains due to the expertise required for annotating scientific data. SciBERT, a model trained on a large corpus of scientific data, leverages unsupervised pre-training and significantly improves the performance of BERT model in scientific NLP tasks. 

Background of SciBERT

The base model BERT is trained on two tasks:

  • Predict randomly masked tokens
  • Predict whether two sentences follow each other

SciBERT follows the same model architecture as BERT; the only difference is – it is trained on scientific data instead.

Vocabulary used by SciBERT

The original WordPiece vocabulary used by the BERT model is termed as BASEVOCAB while dealing with SciBERT. Another WordPiece vocabulary used by SCiBERT is called SCIVOCAB which is constructed on scientific corpus using SentencePiece library.

Core NLP tasks SciBERT can accomplish

Variants of SciBERT

There are four versions of the SciBERT model based on:

(i) cased or uncased


The two models using BASEVOCAB are fine tuned from the corresponding BERT-base models. The other two models which use SCIVOCAB are trained from scratch.

Practical implementation

Here’s a demonstration of NCBI disease corpus task – a Named Entity Recognition (NER) task in the biomedical field. The data used is a part of a collection of 793 PubMed abstracts having annotated disease entities. Every token entity has a ‘B-’ (Beginning) tag indicating if the token is at the start of the entity or an ‘I-’ (Inside) tag indicating that the token is inside the annotation while the ‘O’ tag suggests that the token is not a named entity.

The code has been implemented in Google colab with Python version 3.7.10. Step-wise explanation of the code is as follows:

  1. Get the NCBI data from The AllenAI’ s GitHub repository
 DATADIR="NCBI_disease" #Name given to the directory
 if test ! -d "$DATADIR";then
     echo "Creating $DATADIR dir" #print statement
     mkdir "$DATADIR" #Create a directory if it doesn’t exist
     cd "$DATADIR" #change the current working directory
     #download development set data
     #download training data
     #download testset data
 fi #end of ‘if’ condition 

2) Clone the GitHub repository of bert-sklearn, a scikit-learn wrapper for fine-tuning the BERT model

!git clone -b master

Change the directory and install bert-sklearn

!cd bert-sklearn; pip install .

3) Import required libraries

 import os
 import math
 import random
 import csv
 import sys
 import numpy as np
 import pandas as pd
 from sklearn import metrics
 from sklearn.metrics import f1_score, precision_score, recall_score
 from sklearn.metrics import classification_report
 import statistics as stats
 from bert_sklearn import BertTokenClassifier 

4) Define a function to read tsv file (‘tsv’ stands for ‘tab-separated values’)

 def read_tsv(fname, quotechar=None):
 #open the utf-8 encoded file in read mode
     with open(fname, "r", encoding='utf-8') as f:
         return list(csv.reader(f, delimiter="\t", quotechar=quotechar))   

csv.reader() function generates a reader object to extract data from a csv file. It takes each line of the file and makes a list of all the columns. 

You then need to choose just the column from which you want the variable data.

5) Define a function to flatten the array of tokens

    def flatten(l):
       return [item for sublist in l for item in sublist] 

6) Define a function to read the data file in CoNLL-2003 shared task format.

  def read_CoNLL2003(fname, index=3):
     # Read the file
     lines =  open(fname).read().strip()   
     # Find sentence-like boundaries by splitting on seeing 2 newline   
     lines = lines.split("\n\n")  
     # Split on new lines
     lines = [l.split("\n") for l in lines]
     # Get tokens
     tokens = [[l.split()[0] for l in line] for line in lines]
     # Get output labels or tags
     labels = [[l.split()[index] for l in ln] for ln in lines]
     #Convert the data comprising tokens and corresponding labels to a  
     dt = {'tokens': tokens, 'labels': labels}
     return df  

7) Define a function to read train, dev and test set data 

 DATADIR = "NCBI_disease/"
 def get_data(trainfile=DATADIR + "train.txt",
              devfile=DATADIR + "dev.txt",
              testfile=DATADIR + "test.txt"):
     #Read the train,dev and test files provided as parameters 
     train = read_CoNLL2003_format(trainfile, index=3)    
     dev = read_CoNLL2003_format(devfile, index=3)
     #Combine the train and dev set
     train = pd.concat([train, dev])
     print("Train and dev data: %d sentences, %d tokens"%
     test = read_CoNLL2003_format(testfile, index=3)
     print("Test data: %d sentences, %d tokens"%
     return train, test 

8) Perform train-test split

 #Store the train and test set data returned by the get_data() function defined in step (7)
  train, test = get_data()
 #Separate out features (tokens) and labels of training and test set
 X_train, y_train = train.tokens, train.labels
 X_test, y_test = test.tokens, test.labels
 print(len(train)) #Print the number of instances in training set
 labels = np.unique(flatten(y_train)) #get unique labels
 labels = list(label_list) #form a list of labels
 print("\nNER tags:",labels) #print the list containing unique labels 


 Train and dev data: 6347 sentences, 159670 tokens
 Test data: 940 sentences, 24497 tokens
 NER tags: ['B-Disease', 'I-Disease', 'O'] 

9) See the initial records of training data


SciBERT data

10) Initialize the SciBERT model

Out of the four versions of SciBERT, here we are using BASEVOCAB  CASED version.

 %%time  #to record execution time
 model = BertTokenClassifier
       #gradient accumulation
                             train_batch_size=16,#batch size for training
                             eval_batch_size=16, #batch size for evaluation
                     #ignore the tokens with label ‘O’                      

The ‘max_seq_length’represents the length of a token sequence that the model can handle. BERT’s limit is 512 tokens but here we explicitly limit it to 178 (176 tokens + 2 for [CLS] and [SEP] delimiters used by BERT model)

 #Print the model’s configuration

Sample output:

SciBERT model

11) Fit the fine-tuned BERT model on training data, y_train)

Sample output:

SciBERT output

12) Make predictions on test data

y_preds = model.predict(X_test)

13)Print classification report on model’s performance

print(classification_report(flatten(y_test), flatten(y_preds)))

Sample output:

SciBERT output

Note: The outputs may vary a bit for each execution of the code and also depending on the execution environment you choose.

  • Code source
  • Google colab notebook of the above implementation can be found here.


Refer to the following sources to have in-depth understanding of the SciBERT model:

Picture of Nikita Shiledarbaxi

Nikita Shiledarbaxi

A zealous learner aspiring to advance in the domain of AI/ML. Eager to grasp emerging techniques to get insights from data and hence explore realistic Data Science applications as well.
Related Posts


Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.