Why NLP-Powered sciSpacy Is A Game-Changer For Biomedical Text Processing

A human genome contains genetic information of an organism as DNA sequences in the form of 23 chromosomes. And a single DNA molecule consists of two strands which are connected by four different bases (A, T, C, G).

The human genome consists of around 3 billion of these base pairs. So, if a base pair is considered as 2-bit combination then considering all the base pairs, a diploid cell would contain 1.5 GB of data. And humans contain around 100 trillion cells. The numbers are astounding.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Tasking a biomedical researcher with handling data which is not only inherently large but also comes with a multitude of combinations and classifications.

Add to this, there are frequent discoveries of drugs and proteins by academia.

Download our Mobile App

All this information is stored in the form of tonnes of text. Skimming through this text for discoveries and deductions takes a lifetime. Though computers have made it easy to find information like a specific genome name but only in a naive way as the user has to possess the information prior to the search.

The researchers at Allen Institute of Artificial Intelligence came up with a new tool or a library by the name sciSpacy, developed specifically for biomedical or scientific text processing.

Most of the tools available today, deal with entity linking, abbreviation and negation detection. For traditional NLP tasks, there is GENIA. But these tools do not implement state-of-the-art word representations and neural networks.

Making A Room For Biomedical Applications With sciSpacy

In a paper titled scispaCy: Fast and Robust Models for Biomedical Natural Language Processing, the researchers introduce a specialised NLP library for processing biomedical texts, built on the spaCy library.

To emphasise the efficiency and practical utility of the end-to-end pipeline provided by scispaCy packages, a speed comparison is performed in comparison with several other publicly available processing pipelines for biomedical text using 10k randomly selected PubMed abstracts.

For training, the researchers used GENIA 1.0 corpus. This dataset has parts of speech tags annotated, which was used to train the parts of speech tagger jointly with the dependency parser.

The researchers have also included the PubMed metadata for the abstracts which was discarded in the GENIA corpus.

The original metadata includes relevant named entities of chemical and drugs associated to a variety of ontologies along with citation statistics and journal metadata.

For named entity recognition (NER) models, the training was done on the following datasets:

  • BC5CDR – for chemicals and diseases
  • CRAFT – for cell types, chemicals, proteins, genes
  • JNLPBA – for cell lines, cell types, DNAs, RNAs, proteins and
  • BioNLP13CG – for cancer genetics

Along with the datasets mentioned above, the researchers have also covered five more datasets such as Linnaeus and AnatEM for a variety of entity types which include cancer genetics, pathway analysis, trial population extraction etc.

Another key challenge with biomedical data is with its commonly occurring abbreviated names and noun compounds containing punctuation, which might lead to misidentification.

So, for evaluating sentence segmentation, both sentence and full-abstract accuracy were used.

Read more about the sciSpacy here


pip install scispacy

A Python code for carrying out entity recognition using ‘scispacy’:

import scispacy
import spacy

nlp = spacy.load(“en_core_sci_sm”)
text = “””
Myeloid derived suppressor cells (MDSC) are immature
myeloid cells with immunosuppressive activity.
They accumulate in tumor-bearing mice and humans
with different types of cancer, including hepatocellular
carcinoma (HCC).
doc = nlp(text)

>>> [“Myeloid derived suppressor cells (MDSC) are immature myeloid cells with immunosuppressive activity.”,
    “They accumulate in tumor-bearing mice and humans with different types of cancer, including hepatocellular carcinoma (HCC).”]

>>> (Myeloid derived suppressor cells,
    myeloid cells,
    immunosuppressive activity,
    tumorbearing mice,
    hepatocellular carcinoma,

Key Takeaways

  • Sets a benchmark for named entity recognition models for more specific entity extraction applications and when compared to others.
  • sciSpacy demonstrates a competitive performance by releasing and evaluating two fast and convenient pipelines for biomedical text, which include tokenisation, part of speech tagging, dependency parsing and named entity recognition.


Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.