Why NLP-Powered sciSpacy Is A Game-Changer For Biomedical Text Processing

Published on March 7, 2019
by Ram Sagar

A human genome contains genetic information of an organism as DNA sequences in the form of 23 chromosomes. And a single DNA molecule consists of two strands which are connected by four different bases (A, T, C, G).

The human genome consists of around 3 billion of these base pairs. So, if a base pair is considered as 2-bit combination then considering all the base pairs, a diploid cell would contain 1.5 GB of data. And humans contain around 100 trillion cells. The numbers are astounding.

Tasking a biomedical researcher with handling data which is not only inherently large but also comes with a multitude of combinations and classifications.

Add to this, there are frequent discoveries of drugs and proteins by academia.

All this information is stored in the form of tonnes of text. Skimming through this text for discoveries and deductions takes a lifetime. Though computers have made it easy to find information like a specific genome name but only in a naive way as the user has to possess the information prior to the search.

The researchers at Allen Institute of Artificial Intelligence came up with a new tool or a library by the name sciSpacy, developed specifically for biomedical or scientific text processing.

Most of the tools available today, deal with entity linking, abbreviation and negation detection. For traditional NLP tasks, there is GENIA. But these tools do not implement state-of-the-art word representations and neural networks.

Making A Room For Biomedical Applications With sciSpacy

In a paper titled scispaCy: Fast and Robust Models for Biomedical Natural Language Processing, the researchers introduce a specialised NLP library for processing biomedical texts, built on the spaCy library.

To emphasise the efficiency and practical utility of the end-to-end pipeline provided by scispaCy packages, a speed comparison is performed in comparison with several other publicly available processing pipelines for biomedical text using 10k randomly selected PubMed abstracts.

For training, the researchers used GENIA 1.0 corpus. This dataset has parts of speech tags annotated, which was used to train the parts of speech tagger jointly with the dependency parser.

The researchers have also included the PubMed metadata for the abstracts which was discarded in the GENIA corpus.

The original metadata includes relevant named entities of chemical and drugs associated to a variety of ontologies along with citation statistics and journal metadata.

For named entity recognition (NER) models, the training was done on the following datasets:

BC5CDR – for chemicals and diseases
CRAFT – for cell types, chemicals, proteins, genes
JNLPBA – for cell lines, cell types, DNAs, RNAs, proteins and
BioNLP13CG – for cancer genetics

Along with the datasets mentioned above, the researchers have also covered five more datasets such as Linnaeus and AnatEM for a variety of entity types which include cancer genetics, pathway analysis, trial population extraction etc.

Another key challenge with biomedical data is with its commonly occurring abbreviated names and noun compounds containing punctuation, which might lead to misidentification.

So, for evaluating sentence segmentation, both sentence and full-abstract accuracy were used.

Key Takeaways

Sets a benchmark for named entity recognition models for more specific entity extraction applications and when compared to others.
sciSpacy demonstrates a competitive performance by releasing and evaluating two fast and convenient pipelines for biomedical text, which include tokenisation, part of speech tagging, dependency parsing and named entity recognition.

Access all our open Survey & Awards Nomination forms in one place >>

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.

Why NLP-Powered sciSpacy Is A Game-Changer For Biomedical Text Processing

Making A Room For Biomedical Applications With sciSpacy

Key Takeaways

Ram Sagar

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

KissanAI Releases Dhenu Llama 3, an Indic LLM for Farmers

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Is it Humane to Bash Humane Ai Pin?

Meta Llama 3 Now Available on Databricks For Enterprise

How Databricks is Enabling Agriculture’s Data Revolution with UPL

How Good is Llama 3 for Indic Languages?

OpenAI Hires Pragya Misra As Its First Employee in India

Meta Forces Developers Cite ‘Llama 3’ in their AI Development

India is Making its Own AI Servers

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.

AIM Launches the 3rd Edition of Data Engineering Summit. May 30-31, Bengaluru