Active Hackathon

Stanza – A New NLP Library By Stanford

Stanza

In most NLP libraries, data that is to be processed is in English. Although a few libraries do support other languages, they do not deliver the same results as they do with the data in English. This is because languages vary widely from one another, and the techniques that work for English may not fit well for other languages. To address these challenges, Stanford developed a new library Stanza — a Python-based library for many human languages.

Stanza

Stanza is a Python-based NLP library which contains tools that can be used in a neural pipeline to convert a string containing human language text into lists of sentences and words. This can produce base forms of those words, parts of speech, and morphological features. The toolkit is designed to align with more than 70 languages, using the Universal Dependencies formalism.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Stanza is built with highly accurate neural network components that also enable efficient training and evaluation with your own annotated data. The modules are built on top of the PyTorch library. It also supports GPU to expedite the analysis of various languages, including English.

Also, Stanza includes a Python interface to the CoreNLP Java package and inherits additional functionality from there. This includes, constituency parsing, coreference resolution, and linguistic pattern matching.

The package is available with pip package manager.

pip install Stanza

Key Stanza features

  • Native Python implementation requires minimal effort to set up
  • Full neural network pipeline for robust text analytics, including tokenization, multi-word token (MWT) expansion, lemmatization, part-of-speech (POS) and morphological features like tagging, dependency parsing, and named entity recognition
  • Pretrained neural models supporting 66 (human) languages
  • A stable officially maintained Python interface to CoreNLP

System Design And Architecture

The pipeline consists of models ranging from tokenizing raw text to performing syntactic analysis on the entire sentence. The design is devised keeping the diversity of human languages in mind by data-driven models that learn the differences between languages. Besides, the components of Stanza are highly modular and reuses basic model architectures, when possible, for compactness.

Tokenization and Sentence Split: On feeding raw text, Stanza tokenizes it and groups tokens into sentences as the first step of processing. Unlike other existing toolkits, Stanza combines tokenization and sentence segmentation from raw text into a single module. This is done to predict the position of words in a sentence, as use of words are context-sensitive in some languages. 

Multi-Word Token Expansion: The above methods identify multi-word tokens, which are then further extended into the syntactic words as the foundation for downstream processing. This is accomplished by the use of sequence-to-sequence (seq2seq) model to ensure frequently observed expansions in the training set, as they are always robustly expanded while maintaining the flexibility to model unseen words statistically.

POS and Morphological Feature Tagging: For each word in a sentence, Stanza assigns it as a part-of-speech (POS), and evaluates its universal morphological features (UFeats, e.g., singular/plural, 1st/2nd/3rd person, among others). To predict POS and UFeats, researchers adopted a bidirectional long short-term memory network (Bi-LSTM) as the basic architecture.

Lemmatization: Stanza also lemmatizes each word in a sentence to regain its canonical form (e.g., did→do). Similar to the multi-word token expander, Stanza’s lemmatizer is deployed as an ensemble of a dictionary-based lemmatizer and a neural seq2seq lemmatizer. Besides, an additional classifier is built on the encoder output of the seq2seq model, to predict shortcuts like lowercasing and identity copy for robustness on long input sequences such as URLs.

Performance

Researchers evaluated Stanza on a total of 112 datasets and figured out that its neural pipeline adapts to the text of different genres, resulting in obtaining state-of-the-art performance at each step of the pipeline. Besides, Stanza features a Python interface to the widely used Java CoreNLP software, thereby allowing access to richer functionalities like relation extraction and coreference resolution.

Stanza is open-source and has pre-trained models for all supported languages and datasets available for public download. Researchers hope Stanza can enable multilingual NLP research and applications, and drive new research that can produce insights from a wide range of human languages.

Outlook

While Stanza supports a wide range of languages, it also extends its functionality to other NLP Python tools with its CoreNLP. However, there are a few things that researchers still have to improve to make it a go-to NLP library for processing different languages effectively. 

Firstly, the downloadable Stanza models are only trained on a single dataset. Thus, to check its robustness, they need to train the models with data that are pooled from different sources. Secondly, the library is optimized for accuracy, which at times, comes at the cost of computational efficiency, limiting the toolkit’s use. Finally, the researcher will also have to make it compatible with different techniques of NLP, such as neural coreference resolution or relation extraction for richer text analytics.

You check out our hands-on guide here.

More Great AIM Stories

Rohit Yadav
Rohit is a technology journalist and technophile who likes to communicate the latest trends around cutting-edge technologies in a way that is straightforward to assimilate. In a nutshell, he is deciphering technology. Email: rohit.yadav@analyticsindiamag.com

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

The curious case of Google Cloud revenue

Porat had earlier said that Google Cloud was putting in money to make more money, but even with the bucket-loads of money that it was making, profitability was still elusive.

Global Parliaments can do much more with Artificial Intelligence

The world is using AI to enhance the performance of its policymakers. India, too, has launched its own machine learning system NeVA, which at the moment is not fully implemented across the nation. How can we learn and adopt from the advancement in the Parliaments around the world? 

Why IISc wins?

IISc was selected as the world’s top research university, trumping some of the top Ivy League colleges in the QS World University Rankings 2022

[class^="wpforms-"]
[class^="wpforms-"]