In most NLP libraries, data that is to be processed is in English. Although a few libraries do support other languages, they do not deliver the same results as they do with the data in English. This is because languages vary widely from one another, and the techniques that work for English may not fit well for other languages. To address these challenges, Stanford developed a new library Stanza — a Python-based library for many human languages.
Stanza is a Python-based NLP library which contains tools that can be used in a neural pipeline to convert a string containing human language text into lists of sentences and words. This can produce base forms of those words, parts of speech, and morphological features. The toolkit is designed to align with more than 70 languages, using the Universal Dependencies formalism.
Stanza is built with highly accurate neural network components that also enable efficient training and evaluation with your own annotated data. The modules are built on top of the PyTorch library. It also supports GPU to expedite the analysis of various languages, including English.
Also, Stanza includes a Python interface to the CoreNLP Java package and inherits additional functionality from there. This includes, constituency parsing, coreference resolution, and linguistic pattern matching.
The package is available with pip package manager.
pip install Stanza
Key Stanza features
- Native Python implementation requires minimal effort to set up
- Full neural network pipeline for robust text analytics, including tokenization, multi-word token (MWT) expansion, lemmatization, part-of-speech (POS) and morphological features like tagging, dependency parsing, and named entity recognition
- Pretrained neural models supporting 66 (human) languages
- A stable officially maintained Python interface to CoreNLP
System Design And Architecture
The pipeline consists of models ranging from tokenizing raw text to performing syntactic analysis on the entire sentence. The design is devised keeping the diversity of human languages in mind by data-driven models that learn the differences between languages. Besides, the components of Stanza are highly modular and reuses basic model architectures, when possible, for compactness.
Tokenization and Sentence Split: On feeding raw text, Stanza tokenizes it and groups tokens into sentences as the first step of processing. Unlike other existing toolkits, Stanza combines tokenization and sentence segmentation from raw text into a single module. This is done to predict the position of words in a sentence, as use of words are context-sensitive in some languages.
Multi-Word Token Expansion: The above methods identify multi-word tokens, which are then further extended into the syntactic words as the foundation for downstream processing. This is accomplished by the use of sequence-to-sequence (seq2seq) model to ensure frequently observed expansions in the training set, as they are always robustly expanded while maintaining the flexibility to model unseen words statistically.
POS and Morphological Feature Tagging: For each word in a sentence, Stanza assigns it as a part-of-speech (POS), and evaluates its universal morphological features (UFeats, e.g., singular/plural, 1st/2nd/3rd person, among others). To predict POS and UFeats, researchers adopted a bidirectional long short-term memory network (Bi-LSTM) as the basic architecture.
Lemmatization: Stanza also lemmatizes each word in a sentence to regain its canonical form (e.g., did→do). Similar to the multi-word token expander, Stanza’s lemmatizer is deployed as an ensemble of a dictionary-based lemmatizer and a neural seq2seq lemmatizer. Besides, an additional classifier is built on the encoder output of the seq2seq model, to predict shortcuts like lowercasing and identity copy for robustness on long input sequences such as URLs.
Researchers evaluated Stanza on a total of 112 datasets and figured out that its neural pipeline adapts to the text of different genres, resulting in obtaining state-of-the-art performance at each step of the pipeline. Besides, Stanza features a Python interface to the widely used Java CoreNLP software, thereby allowing access to richer functionalities like relation extraction and coreference resolution.
Stanza is open-source and has pre-trained models for all supported languages and datasets available for public download. Researchers hope Stanza can enable multilingual NLP research and applications, and drive new research that can produce insights from a wide range of human languages.
While Stanza supports a wide range of languages, it also extends its functionality to other NLP Python tools with its CoreNLP. However, there are a few things that researchers still have to improve to make it a go-to NLP library for processing different languages effectively.
Firstly, the downloadable Stanza models are only trained on a single dataset. Thus, to check its robustness, they need to train the models with data that are pooled from different sources. Secondly, the library is optimized for accuracy, which at times, comes at the cost of computational efficiency, limiting the toolkit’s use. Finally, the researcher will also have to make it compatible with different techniques of NLP, such as neural coreference resolution or relation extraction for richer text analytics.