Spacy is an open-source software python library used in advanced natural language processing and machine learning. It will be used to build information extraction, natural language understanding systems, and to pre-process text for deep learning. It supports deep learning workflow in convolutional neural networks in parts-of-speech tagging, dependency parsing, and named entity recognition.
Spacy is mainly developed by Matthew Honnibal and maintained by Ines Montani. Scipy is written in Python and Cython (C binding of python). We can use more than 60 languages available for text processing such as English, Hindi, Spanish, German, French, Dutch. Mainly focus on industrial purpose. In contrast, NLTK is mainly used for research purposes and to learn natural language processing.
Introduction to Natural Language Processing
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
It is a technique using python and open source library for Extract information from unstructured text, to identify “named entities”, Analyze word structure in text, including parsing and semantic analysis access popular text databases, including WordNet and treebanks, Integrate techniques drawn from fields deep learning and artificial intelligence. And predictive text and email filtering to automatic summarization and translation.
Spacy v1:
It is the first version of Spacy released in February 2015. It includes nominal features of natural language processing, such as stemming, tokenization, and lemmatization, and some other features.
Spacy v2:
Spacy is the stable version released on 11 December 2020 just 5 days ago. It is built for the software industry purpose. It supports much entity recognition and deep learning integration for the development of a deep learning model and many other features include below.
Features:
- Non-destructive tokenization
- Named entity recognition
- Support for 61+ languages
- 46 statistical models for 16 languages
- Pre-Trained word vectors
- State-of-the-art speed
- Easy deep learning integration
- Part-of-speech tagging
- Labeled dependency parsing
- Syntax-driven sentence segmentation
- Built-in visualizers for syntax and NER
- Convenient string-to-hash mapping
- Export to NumPy data arrays
- Efficient binary serialization
- Easy model packaging and deployment
- Robust, rigorously evaluated accuracy
Installation:
Scipy can be installed using setuptools and wheel.
Using pip:
pip install -U pip setuptools wheel
pip install spacy
Source: link
Using conda:
conda install -c conda-forge spacy
Source: link
SpaCy Models:
To use spacy you are required to install the model using the pip command:
$ python -m spacy download en_core_web_sm
import spacy nlp = spacy.load("en_core_web_sm") text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value" # Process the text doc = nlp(text) # Print the document text print(doc.text)
Output: It’s official: Apple is the first U.S. public company to reach a $1 trillion market value
Source code: https://spacy.io/models
Spacy Pipeline:
- Tokenization:
Word tokens are the basic units of text involved in any NLPlabeling task. The first step, when processing text, is to split it into tokens.
Import the Spacy language class to create an NLP object of that class using the code shown in the following code. Then processing your doc using the NLP object and giving some text data or your text file in it to process it. Select the token you want to print and then print the output using the token and text function to get the value in text form.
# Import the English language class and create the NLP object
from spacy.lang.en import English nlp = English() # Process the text doc = nlp("I like tree kangaroos and narwhals.") # Select the first token first_token = doc[0] # Print the first token's text print(first_token.text)
Output: I
Source code: https://course.spacy.io/en/chapter1
When we learn basic grammar, we understand the difference between nouns, verbs, adjectives, and adverbs, and although it may seem pointless at the time, it turns out to be a critical element of Natural Language Processing.
Spacy provides convenient tools for breaking down sentences into lists of words and then categorizing each word with a specific part of speech based on the context.
Here is the below code to get the P.O.S:
Import the Spacy, and load model then process the text using nlp object now iterate over the loop to get the text->POS->dependency label as shown in the code.
import spacy nlp = spacy.load("en_core_web_sm") text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value" # Process the text doc = nlp(text) for token in doc: # Get the token text, part-of-speech tag and dependency label token_text = token.text token_pos = token.pos_ token_dep = token.dep_ # This is for formatting only print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")
Output: It PRON nsubj ’s VERB punct official NOUN ROOT : PUNCT punct Apple PROPN nsubj is AUX ROOT the DET det first ADJ amod U.S. PROPN nmod public ADJ amod company NOUN attr to PART aux reach VERB relcl a DET det $ SYM quantmod 1 NUM compound trillion NUM nummod market NOUN compound value NOUN dobj
3. Name Entity Detection:
one of the most common labeling problems is finding entities in the text. Typically Name Entity detection constitutes the name of politicians, actors, and famous locations, and organizations, and products available in the market of that organization.
Just import the spacy and load model and process the text using the nlp then iterate over every entity and print their label.
import spacy nlp = spacy.load("en_core_web_sm") text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders" # Process the text doc = nlp(text) # Iterate over the entities for ent in doc.ents: # Print the entity text and label print(ent.text, ent.label_) # Get the span for "iPhone X" iphone_x = doc[1:3] # Print the span text print("Missing entity:", iphone_x.text) Output: Apple ORG Missing entity: iPhone X
4. Dependency parsing:
The main concept of dependency parsing is that each linguistic unit (words) is connected by a directed link. These links are called dependencies in linguistics.
#import the spacy and displacy to visualize the dependencies in each word.
Code:
import spacy from spacy import displacy nlp = spacy.load("en_core_web_sm") doc = nlp("This is a sentence.") displacy.serve(doc, style="dep")
Output:
5. Matcher:
The Matcher is very powerful and allows you to bootstrap a lot of NLP based tasks, such as entity extraction, finding the pattern matched in the text or document.
Same as the above code, import the spacy, Matcher and initialize the matcher with the doc and define a pattern which you want to search in the doc. Then add the pattern to the matcher. Then print matches in the matcher docs.
Look at the below code for clarity.
Code:
import spacy # Import the Matcher from spacy.matcher import Matcher nlp = spacy.load("en_core_web_sm") doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders") # Initialize the Matcher with the shared vocabulary matcher = Matcher(nlp.vocab) # Create a pattern matching two tokens: "iPhone" and "X" pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}] # Add the pattern to the matcher matcher.add("IPHONE_X_PATTERN", None, pattern) # Use the matcher on the doc matches = matcher(doc) print("Matches:", [doc[start:end].text for match_id, start, end in matches]) Output: Matches: ['iPhone X']
Github:https://github.com/explosion/spaCy
Summary:
We learn about the Spacy python library for NLP problems. We have known about NLP and the use of Spacy to solve the tasks and their use in the industry. Some important Spacy pipelines and their code in the development of advanced NLP models.