Active Hackathon

Guide To SciBERT: A Pre-trained BERT-Based Language Model For Scientific Text


 SciBERT is a pre-trained BERT-based language model for performing scientific tasks in the field of Natural Language Processing (NLP). It was introduced by Iz Beltagy, Kyle Lo and Arman Cohan – researchers at the Allen Institute for Artificial Intelligence (AllenAI) in September 2019 (research paper).

Since the architecture of SciBERT is based on the BERT (Bidirectional Encoder Representations from Transformers) model, go through the BERT research paper if you are unaware of the state-of-the-art base model.


Sign up for your weekly dose of what's up in emerging technology.

Overview of SciBERT 

Training deep neural networks for NLP tasks requires a huge amount of labelled data. Though it is reasonable to gather such a large amount of data in several general domains, the data acquisition task seems difficult when it comes to scientific domains due to the expertise required for annotating scientific data. SciBERT, a model trained on a large corpus of scientific data, leverages unsupervised pre-training and significantly improves the performance of BERT model in scientific NLP tasks. 

Background of SciBERT

The base model BERT is trained on two tasks:

  • Predict randomly masked tokens
  • Predict whether two sentences follow each other

SciBERT follows the same model architecture as BERT; the only difference is – it is trained on scientific data instead.

Vocabulary used by SciBERT

The original WordPiece vocabulary used by the BERT model is termed as BASEVOCAB while dealing with SciBERT. Another WordPiece vocabulary used by SCiBERT is called SCIVOCAB which is constructed on scientific corpus using SentencePiece library.

Core NLP tasks SciBERT can accomplish

Variants of SciBERT

There are four versions of the SciBERT model based on:

(i) cased or uncased


The two models using BASEVOCAB are fine tuned from the corresponding BERT-base models. The other two models which use SCIVOCAB are trained from scratch.

Practical implementation

Here’s a demonstration of NCBI disease corpus task – a Named Entity Recognition (NER) task in the biomedical field. The data used is a part of a collection of 793 PubMed abstracts having annotated disease entities. Every token entity has a ‘B-’ (Beginning) tag indicating if the token is at the start of the entity or an ‘I-’ (Inside) tag indicating that the token is inside the annotation while the ‘O’ tag suggests that the token is not a named entity.

The code has been implemented in Google colab with Python version 3.7.10. Step-wise explanation of the code is as follows:

  1. Get the NCBI data from The AllenAI’ s GitHub repository
 DATADIR="NCBI_disease" #Name given to the directory
 if test ! -d "$DATADIR";then
     echo "Creating $DATADIR dir" #print statement
     mkdir "$DATADIR" #Create a directory if it doesn’t exist
     cd "$DATADIR" #change the current working directory
     #download development set data
     #download training data
     #download testset data
 fi #end of ‘if’ condition 

2) Clone the GitHub repository of bert-sklearn, a scikit-learn wrapper for fine-tuning the BERT model

!git clone -b master

Change the directory and install bert-sklearn

!cd bert-sklearn; pip install .

3) Import required libraries

 import os
 import math
 import random
 import csv
 import sys
 import numpy as np
 import pandas as pd
 from sklearn import metrics
 from sklearn.metrics import f1_score, precision_score, recall_score
 from sklearn.metrics import classification_report
 import statistics as stats
 from bert_sklearn import BertTokenClassifier 

4) Define a function to read tsv file (‘tsv’ stands for ‘tab-separated values’)

 def read_tsv(fname, quotechar=None):
 #open the utf-8 encoded file in read mode
     with open(fname, "r", encoding='utf-8') as f:
         return list(csv.reader(f, delimiter="\t", quotechar=quotechar))   

csv.reader() function generates a reader object to extract data from a csv file. It takes each line of the file and makes a list of all the columns. 

You then need to choose just the column from which you want the variable data.

5) Define a function to flatten the array of tokens

    def flatten(l):
       return [item for sublist in l for item in sublist] 

6) Define a function to read the data file in CoNLL-2003 shared task format.

  def read_CoNLL2003(fname, index=3):
     # Read the file
     lines =  open(fname).read().strip()   
     # Find sentence-like boundaries by splitting on seeing 2 newline   
     lines = lines.split("\n\n")  
     # Split on new lines
     lines = [l.split("\n") for l in lines]
     # Get tokens
     tokens = [[l.split()[0] for l in line] for line in lines]
     # Get output labels or tags
     labels = [[l.split()[index] for l in ln] for ln in lines]
     #Convert the data comprising tokens and corresponding labels to a  
     dt = {'tokens': tokens, 'labels': labels}
     return df  

7) Define a function to read train, dev and test set data 

 DATADIR = "NCBI_disease/"
 def get_data(trainfile=DATADIR + "train.txt",
              devfile=DATADIR + "dev.txt",
              testfile=DATADIR + "test.txt"):
     #Read the train,dev and test files provided as parameters 
     train = read_CoNLL2003_format(trainfile, index=3)    
     dev = read_CoNLL2003_format(devfile, index=3)
     #Combine the train and dev set
     train = pd.concat([train, dev])
     print("Train and dev data: %d sentences, %d tokens"%
     test = read_CoNLL2003_format(testfile, index=3)
     print("Test data: %d sentences, %d tokens"%
     return train, test 

8) Perform train-test split

 #Store the train and test set data returned by the get_data() function defined in step (7)
  train, test = get_data()
 #Separate out features (tokens) and labels of training and test set
 X_train, y_train = train.tokens, train.labels
 X_test, y_test = test.tokens, test.labels
 print(len(train)) #Print the number of instances in training set
 labels = np.unique(flatten(y_train)) #get unique labels
 labels = list(label_list) #form a list of labels
 print("\nNER tags:",labels) #print the list containing unique labels 


 Train and dev data: 6347 sentences, 159670 tokens
 Test data: 940 sentences, 24497 tokens
 NER tags: ['B-Disease', 'I-Disease', 'O'] 

9) See the initial records of training data


SciBERT data

10) Initialize the SciBERT model

Out of the four versions of SciBERT, here we are using BASEVOCAB  CASED version.

 %%time  #to record execution time
 model = BertTokenClassifier
       #gradient accumulation
                             train_batch_size=16,#batch size for training
                             eval_batch_size=16, #batch size for evaluation
                     #ignore the tokens with label ‘O’                      

The ‘max_seq_length’represents the length of a token sequence that the model can handle. BERT’s limit is 512 tokens but here we explicitly limit it to 178 (176 tokens + 2 for [CLS] and [SEP] delimiters used by BERT model)

 #Print the model’s configuration

Sample output:

SciBERT model

11) Fit the fine-tuned BERT model on training data, y_train)

Sample output:

SciBERT output

12) Make predictions on test data

y_preds = model.predict(X_test)

13)Print classification report on model’s performance

print(classification_report(flatten(y_test), flatten(y_preds)))

Sample output:

SciBERT output

Note: The outputs may vary a bit for each execution of the code and also depending on the execution environment you choose.

  • Code source
  • Google colab notebook of the above implementation can be found here.


Refer to the following sources to have in-depth understanding of the SciBERT model:

More Great AIM Stories

Nikita Shiledarbaxi
A zealous learner aspiring to advance in the domain of AI/ML. Eager to grasp emerging techniques to get insights from data and hence explore realistic Data Science applications as well.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

Council Post: How to Evolve with Changing Workforce

The demand for digital roles is growing rapidly, and scouting for talent is becoming more and more difficult. If organisations do not change their ways to adapt and alter their strategy, it could have a significant business impact.

All Tech Giants: On your Mark, Get Set – Slow!

In September 2021, the FTC published a report on M&As of five top companies in the US that have escaped the antitrust laws. These were Alphabet/Google, Amazon, Apple, Facebook, and Microsoft.

The Digital Transformation Journey of Vedanta

In the current digital ecosystem, the evolving technologies can be seen both as an opportunity to gain new insights as well as a disruption by others, says Vineet Jaiswal, chief digital and technology officer at Vedanta Resources Limited

BlenderBot — Public, Yet Not Too Public

As a footnote, Meta cites access will be granted to academic researchers and people affiliated to government organisations, civil society groups, academia and global industry research labs.