Natural Language Processing (NLP) is one of the most popular domains in machine learning. It is a collection of methods to make the machine learn and understand the language of humans. The wide adoption of its applications has made it a hot skill amongst top companies. Here are a few frequently-used NLP frameworks that can handle both naive and sophisticated language modelling:
TensorFlow is currently the hottest framework available for a wide variety of deep learning applications. The toolkit contains customised options tailored to increase the ease of building a machine learning pipeline for NLP tasks.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
And, in case one wants to experiment on word embeddings to look up the vector for each of the source words in the batch, the following can be done:
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
pip install tensorflow
PyTorch’s Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors.
Numpy is a great framework, but it cannot utilise GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of 50x or greater, so, unfortunately, Numpy won’t be enough for modern deep learning.
Behind the scenes, Tensors can keep track of a computational graph and gradients, but they’re also useful as a generic tool for scientific computing.
At its core, PyTorch provides two main features:
- An n-dimensional Tensor, similar to numpy but can run on GPUs
- Automatic differentiation for building and training neural networks
These features along with many others make PyTorch, a suitable candidate for doing NLP tasks.
Theano is a Python library that lets you define, optimise, and evaluate mathematical expressions, especially ones with multi-dimensional arrays.
Many frameworks like Keras are built on top of Theano. Theano offers tools for a variety of NLP tasks like machine translation, speech recognition, word embedding, and text classification.
Keras follows best practices for reducing cognitive load: it offers consistent and simple APIs, it minimises the number of user actions required for common use cases, and it provides clear and actionable feedback upon user error. Keras integrates with lower-level deep learning languages (in particular TensorFlow).
Here’s a sample ‘visual question-answering’ model using Keras:
A model can select the correct one-word answer when asked a natural-language question about a picture. It works by encoding the question into a vector, encoding the image into a vector, concatenating the two, and training on top a logistic regression over some vocabulary of potential answers.
from keras.layers import Conv2D, MaxPooling2D, Flatten
from keras.layers import Input, LSTM, Embedding, Dense
from keras.models import Model, Sequential
Check full code here
Chainer is a Python-based, standalone open-source framework for Deep Learning models. Chainer provides a flexible, intuitive, and high-performance means of implementing a full range of deep learning models, including state-of-the-art models such as recurrent neural networks and variational autoencoders.
Recurrent Neural Net Language Model (RNNLM) is a type of neural net language models which contains the RNNs in the network. Since an RNN can deal with the variable-length inputs, it is suitable for modelling the sequential data such as sentences in natural language.
An RNNLM written in Chainer is shown below:
def __init__(self, n_vocab, n_units):
Stanford CoreNLP provides a set of human language technology tools. Its goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. A tool pipeline can be run on a piece of plain text with just two lines of code. CoreNLP is designed to be highly flexible and extensible.
Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis,bootstrapped pattern learning, and the open information extraction tools.
Advantages can be summarised as follows:
- An integrated NLP toolkit with a broad range of grammatical analysis tools
- A fast, robust annotator for arbitrary texts, widely used in production
- A modern, regularly updated package, with the overall highest quality text analytics
- Support for a number of major (human) languages
- Available APIs for most major modern programming languages
- Ability to run as a simple web service
Deep Learning library featuring a higher-level API for TensorFlow. TFlearn is a modular and transparent deep learning library built on top of Tensorflow. It was designed to provide a higher-level API to TensorFlow in order to facilitate and speed-up experimentations while remaining fully transparent and compatible with it.
The high-level API currently supports most of recent deep learning models, such as Convolutions, LSTM, BiRNN, BatchNorm, PReLU, Residual networks, Generative networks.
pip install tflearn
The main objective of all these frameworks is to increase the ease of operation in dealing with deep learning algorithms. While TensorFlow, Keras and PyTorch have been doing the noise in the NLP community, the other alternatives either bring their own set of advantages or collaborate with other frameworks for a better experience overall.
For further reading, check this post by Olga Davydova