MITB Banner

Why NLP-Powered sciSpacy Is A Game-Changer For Biomedical Text Processing

A human genome contains genetic information of an organism as DNA sequences in the form of 23 chromosomes. And a single DNA molecule consists of two strands which are connected by four different bases (A, T, C, G).

The human genome consists of around 3 billion of these base pairs. So, if a base pair is considered as 2-bit combination then considering all the base pairs, a diploid cell would contain 1.5 GB of data. And humans contain around 100 trillion cells. The numbers are astounding.

Tasking a biomedical researcher with handling data which is not only inherently large but also comes with a multitude of combinations and classifications.

Add to this, there are frequent discoveries of drugs and proteins by academia.

All this information is stored in the form of tonnes of text. Skimming through this text for discoveries and deductions takes a lifetime. Though computers have made it easy to find information like a specific genome name but only in a naive way as the user has to possess the information prior to the search.

The researchers at Allen Institute of Artificial Intelligence came up with a new tool or a library by the name sciSpacy, developed specifically for biomedical or scientific text processing.

Most of the tools available today, deal with entity linking, abbreviation and negation detection. For traditional NLP tasks, there is GENIA. But these tools do not implement state-of-the-art word representations and neural networks.

Making A Room For Biomedical Applications With sciSpacy

In a paper titled scispaCy: Fast and Robust Models for Biomedical Natural Language Processing, the researchers introduce a specialised NLP library for processing biomedical texts, built on the spaCy library.

To emphasise the efficiency and practical utility of the end-to-end pipeline provided by scispaCy packages, a speed comparison is performed in comparison with several other publicly available processing pipelines for biomedical text using 10k randomly selected PubMed abstracts.

For training, the researchers used GENIA 1.0 corpus. This dataset has parts of speech tags annotated, which was used to train the parts of speech tagger jointly with the dependency parser.

The researchers have also included the PubMed metadata for the abstracts which was discarded in the GENIA corpus.

The original metadata includes relevant named entities of chemical and drugs associated to a variety of ontologies along with citation statistics and journal metadata.

For named entity recognition (NER) models, the training was done on the following datasets:

  • BC5CDR – for chemicals and diseases
  • CRAFT – for cell types, chemicals, proteins, genes
  • JNLPBA – for cell lines, cell types, DNAs, RNAs, proteins and
  • BioNLP13CG – for cancer genetics

Along with the datasets mentioned above, the researchers have also covered five more datasets such as Linnaeus and AnatEM for a variety of entity types which include cancer genetics, pathway analysis, trial population extraction etc.

Another key challenge with biomedical data is with its commonly occurring abbreviated names and noun compounds containing punctuation, which might lead to misidentification.

So, for evaluating sentence segmentation, both sentence and full-abstract accuracy were used.

Read more about the sciSpacy here

Installation

pip install scispacy

A Python code for carrying out entity recognition using ‘scispacy’:

import scispacy
import spacy

nlp = spacy.load(“en_core_sci_sm”)
text = “””
Myeloid derived suppressor cells (MDSC) are immature
myeloid cells with immunosuppressive activity.
They accumulate in tumor-bearing mice and humans
with different types of cancer, including hepatocellular
carcinoma (HCC).
“””
doc = nlp(text)

print(list(doc.sents))
>>> [“Myeloid derived suppressor cells (MDSC) are immature myeloid cells with immunosuppressive activity.”,
    “They accumulate in tumor-bearing mice and humans with different types of cancer, including hepatocellular carcinoma (HCC).”]

print(doc.ents)
>>> (Myeloid derived suppressor cells,
    MDSC,
    immature,
    myeloid cells,
    immunosuppressive activity,
    accumulate,
    tumorbearing mice,
    humans,
    cancer,
    hepatocellular carcinoma,
    HCC)

Key Takeaways

  • Sets a benchmark for named entity recognition models for more specific entity extraction applications and when compared to others.
  • sciSpacy demonstrates a competitive performance by releasing and evaluating two fast and convenient pipelines for biomedical text, which include tokenisation, part of speech tagging, dependency parsing and named entity recognition.

 

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories