MITB Banner

Hands-On Guide To Natural language Processing Using Spacy

Spacy is an open-source software python library used in advanced natural language processing and machine learning.

Spacy is an open-source software python library used in advanced natural language processing and machine learning. It will be used to build information extraction, natural language understanding systems, and to pre-process text for deep learning. It supports deep learning workflow in convolutional neural networks in parts-of-speech tagging, dependency parsing, and named entity recognition.

Spacy is mainly developed by Matthew Honnibal and maintained by Ines Montani. Scipy is written in Python and Cython (C binding of python). We can use more than 60 languages available for text processing such as English, Hindi, Spanish, German, French, Dutch. Mainly focus on industrial purpose. In contrast, NLTK is mainly used for research purposes and to learn natural language processing.

Introduction to Natural Language Processing

It is a technique using python and open source library for Extract information from unstructured text, to identify “named entities”, Analyze word structure in text, including parsing and semantic analysis access popular text databases, including WordNet and treebanks, Integrate techniques drawn from fields deep learning and artificial intelligence. And predictive text and email filtering to automatic summarization and translation.

Spacy v1:

It is the first version of Spacy released in February 2015. It includes nominal features of natural language processing, such as stemming, tokenization, and lemmatization, and some other features.

Spacy v2:

Spacy is the stable version released on 11 December 2020 just 5 days ago. It is built for the software industry purpose. It supports much entity recognition and deep learning integration for the development of a deep learning model and many other features include below.

Features:

  • Non-destructive tokenization
  • Named entity recognition
  • Support for 61+ languages
  • 46 statistical models for 16 languages
  • Pre-Trained word vectors
  • State-of-the-art speed
  • Easy deep learning integration
  • Part-of-speech tagging
  • Labeled dependency parsing
  • Syntax-driven sentence segmentation
  • Built-in visualizers for syntax and NER
  • Convenient string-to-hash mapping
  • Export to NumPy data arrays
  • Efficient binary serialization
  • Easy model packaging and deployment
  • Robust, rigorously evaluated accuracy

Installation:

Scipy can be installed using setuptools and wheel.

Using pip:

pip install -U pip setuptools wheel

pip install spacy

Source: link

Using conda:

conda install -c conda-forge spacy

Source: link

SpaCy Models:

To use spacy you are required to install the model using the pip command:

$ python -m spacy download en_core_web_sm

 import spacy
 nlp = spacy.load("en_core_web_sm")
 text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
 # Process the text
 doc = nlp(text)
 # Print the document text
 print(doc.text) 

Output: It’s official: Apple is the first U.S. public company to reach a $1 trillion market value

Source code: https://spacy.io/models

Spacy Pipeline:

  1. Tokenization:

Word tokens are the basic units of text involved in any NLPlabeling task. The first step, when processing text, is to split it into tokens.

Import the Spacy language class to create an NLP object of that class using the code shown in the following code. Then processing your doc using the NLP object and giving some text data or your text file in it to process it. Select the token you want to print and then print the output using the token and text function to get the value in text form.

# Import the English language class and create the NLP object 

 from spacy.lang.en import English
 ​
 nlp = English()
 ​
 # Process the text
 doc = nlp("I like tree kangaroos and narwhals.")
 ​
 # Select the first token
 first_token = doc[0]
 ​
 # Print the first token's text
 print(first_token.text) 

Output: I

Source code: https://course.spacy.io/en/chapter1

2. Parts of speech tagging:

When we learn basic grammar, we understand the difference between nouns, verbs, adjectives, and adverbs, and although it may seem pointless at the time, it turns out to be a critical element of Natural Language Processing.

Spacy provides convenient tools for breaking down sentences into lists of words and then categorizing each word with a specific part of speech based on the context.

Here is the below code to get the P.O.S:

Import the Spacy, and load model then process the text using nlp object now iterate over the loop to get the text->POS->dependency label as shown in the code.

 import spacy
 nlp = spacy.load("en_core_web_sm")
 text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
 # Process the text
 doc = nlp(text)
 for token in doc:
     # Get the token text, part-of-speech tag and dependency label
     token_text = token.text
     token_pos = token.pos_
     token_dep = token.dep_
     # This is for formatting only
     print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}") 
 Output:
 It          PRON      nsubj     
 ’s          VERB      punct     
 official    NOUN      ROOT      
 :           PUNCT     punct     
 Apple       PROPN     nsubj     
 is          AUX       ROOT      
 the         DET       det       
 first       ADJ       amod      
 U.S.        PROPN     nmod      
 public      ADJ       amod      
 company     NOUN      attr      
 to          PART      aux       
 reach       VERB      relcl     
 a           DET       det       
 $           SYM       quantmod  
 1           NUM       compound  
 trillion    NUM       nummod    
 market      NOUN      compound  
 value       NOUN      dobj   

3. Name Entity Detection:

one of the most common labeling problems is finding entities in the text. Typically Name Entity detection constitutes the name of politicians, actors, and famous locations, and organizations, and products available in the market of that organization.

Just import the spacy and load model and process the text using the nlp then iterate over every entity and print their label.

 import spacy
 nlp = spacy.load("en_core_web_sm")
 text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"
 # Process the text
 doc = nlp(text)
 # Iterate over the entities
 for ent in doc.ents:
     # Print the entity text and label
     print(ent.text, ent.label_)
 # Get the span for "iPhone X"
 iphone_x = doc[1:3]
 # Print the span text
 print("Missing entity:", iphone_x.text)
 Output:  Apple ORG
 Missing entity: iPhone X 

4. Dependency parsing:

The main concept of dependency parsing is that each linguistic unit (words) is connected by a directed link. These links are called dependencies in linguistics.

#import the spacy and displacy to visualize the dependencies in each word.

Code:

 import spacy
 from spacy import displacy
 nlp = spacy.load("en_core_web_sm")
 doc = nlp("This is a sentence.")
 displacy.serve(doc, style="dep") 

Output:

5. Matcher:

The Matcher is very powerful and allows you to bootstrap a lot of NLP based tasks, such as entity extraction, finding the pattern matched in the text or document.

Same as the above code, import the spacy, Matcher and initialize the matcher with the doc and define a pattern which you want to search in the doc. Then add the pattern to the matcher. Then print matches in the matcher docs.

Look at the below code for clarity.

Code:

 import spacy
 # Import the Matcher
 from spacy.matcher import Matcher
 nlp = spacy.load("en_core_web_sm")
 doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")
 # Initialize the Matcher with the shared vocabulary
 matcher = Matcher(nlp.vocab)
 # Create a pattern matching two tokens: "iPhone" and "X"
 pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
 # Add the pattern to the matcher
 matcher.add("IPHONE_X_PATTERN", None, pattern)
 # Use the matcher on the doc
 matches = matcher(doc)
 print("Matches:", [doc[start:end].text for match_id, start, end in matches])
 Output:
 Matches: ['iPhone X'] 

Github:https://github.com/explosion/spaCy

Summary:

We learn about the Spacy python library for NLP problems. We have known about NLP and the use of Spacy to solve the tasks and their use in the industry. Some important Spacy pipelines and their code in the development of advanced NLP models.

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Amit Singh

Amit Singh

Amit Singh is Data Scientist, graduated in Computer Science and Engineering. Data Science writer at Analytics India Magazine.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories