Last updated December 18, 2020
In AI Mysteries

AllenNLP: Quick-start Guide To NLP Research Library

Allen Institute for Artificial Intelligence, which is one of the leading analysis organizations of Artificial Intelligence, develops this PyTorch-based library

Share

Published on December 18, 2020

by Amit Singh

AllenNLP is an open-source deep-learning library for NLP. Allen Institute for Artificial Intelligence, which is one of the leading analysis organizations of Artificial Intelligence, develops this PyTorch-based library. It is used for the chatbot development and analysis of text data. AllenNLP has the feature to specialize in research development. It is open for conducting research and publication using this library and for industry-based projects.

In the previous article of Spacy NLP Library, we covered all the nlp tasks such as tokenization, word embedding, and named entity recognition. We use all these methods in text classification

and label these texts using the Spacy tokenization method.

Research paper:

Github: https://github.com/allenai/allennlp

Installation:

Using conda:

 conda create -n allennlp 
 conda activate allennlp
 pip install allennlp

Using pip:

pip install allennlp

Using git:

git clone https://github.com/allenai/allennlp.git

Using docker:

We can clone the docker container, for deployment please follow the link below.

https://hub.docker.com/r/allennlp/allennlp/dockerfile

DataSet Reader:

It takes raw text data as input to pre-processes that text data such as tokenization and lowercasing and lemmatization into the different object with labels.

In this class, we define two functions first one is for tokenization using SpacyTokensizer,

We covered all the features of tokenizer and token_index in this blog. The second function is to read the text_file line by line and apply the label to it using the Label field method. Then create different instances.

Code:

 @DatasetReader.register('classification-tsv')
 class ClassificationTsvReader(DatasetReader):
     def __init__(self):
         self.tokenizer = SpacyTokenizer()
         self.token_indexers = {'tokens': SingleIdTokenIndexer()}
     def _read(self, file_path: str) -> Iterable[Instance]:
         with open(file_path, 'r') as lines:
             for line in lines:
                 text, label = line.strip().split('\t')
                 text_field = TextField(self.tokenizer.tokenize(text),
                                        self.token_indexers)
                 label_field = LabelField(label)
                 fields = {'text': text_field, 'label': label_field}
                 yield Instance(fields)

Model:

The model is to do all this work in text data.

Get some features in your input corresponding to each word
Merge those word-level characteristics into a vector document-level feature
Classify the vector of document-level functions into one of your labels.

Model Constructor:

We can take any classifier for the classification such as linear classifier or RNN classifier, and define our vocabulary, TextFieldEmbedder, Seq2VecEncoder to apply the label in the instance.

These are the sequences for the text to label using AllenNLP.

Text → Token IDs → Embeddings →Seq2VecEncoder →label

Code:

 @Model.register('simple_classifier')
 class SimpleClassifier(Model):
 def __init__(self,
          #passing the vocabulary
          vocab: Vocabulary,
          #embedding words
          embedder: TextFieldEmbedder,
          #seq2VecEncoder
          encoder: Seq2VecEncoder):
      super().__init__(vocab)
      #defining embedder
      self.embedder = embedder
      #defining encoder
      self.encoder = encoder
      #passing the vocabulary  with labels
      num_labels = vocab.get_vocab_size("labels")
      #classification layer
      self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)

We also have to define some loss function, forward method, and train model function.

The forward method is just a PyTorch function and loss is calculated for the optimization of the model.

Code:

We are going to construct the forward model

#defining Inputs to forward()

 def forward(self,
     text: Dict[str, torch.Tensor],
     label: torch.Tensor) -> Dict[str, torch.Tensor]

#embedding text to the model

 # Shape: (batch_size, num_tokens, embedding_dim)
 embedded_text = self.embedder(text)

#applying seq2vec encoder

 # Shape: (batch_size, num_tokens)
 mask = util.get_text_field_mask(text)
 # Shape: (batch_size, encoding_dim)
 encoded_text = self.encoder(embedded_text, mask)

#prediction layer

 # Shape: (batch_size, num_labels)
 logits = self.classifier(encoded_text)
 # Shape: (batch_size, num_labels)
 probs = torch.nn.functional.softmax(logits)
 # Shape: (1,)
 loss = torch.nn.functional.cross_entropy(logits, label)
 return {'loss': loss, 'probs': probs}

#Final model.forward()

 class SimpleClassifier(Model):
 def forward(self,
              text: Dict[str, torch.Tensor],
              label: torch.Tensor) -> Dict[str, torch.Tensor]:
      # Shape: (batch_size, num_tokens, embedding_dim)
      embedded_text = self.embedder(text)
      # Shape: (batch_size, num_tokens)
      mask = util.get_text_field_mask(text)
      # Shape: (batch_size, encoding_dim)
      encoded_text = self.encoder(embedded_text, mask)
      # Shape: (batch_size, num_labels)
      logits = self.classifier(encoded_text)
      # Shape: (batch_size, num_labels)
      probs = torch.nn.functional.softmax(logits)
      # Shape: (1,)
      loss = torch.nn.functional.cross_entropy(logits, label)
      return {'loss': loss, 'probs': probs}

Configuration files:

We already define our model classifier now to check the configuration of a text file into the label we build a model for it to the embedder and Encoder format of the data as a JSON format.

Code:

 def build_model(vocab: Vocabulary) -> Model:
     print("Building the model")
     vocab_size = vocab.get_vocab_size("tokens")
     embedder = BasicTextFieldEmbedder(
         {"tokens": Embedding(embedding_dim=10, num_embeddings=vocab_size)})
     encoder = BagOfEmbeddingsEncoder(embedding_dim=10)
     return SimpleClassifier(vocab, embedder, encoder)

The output of JSON:

The JSON represents the model file as simple_classifier embedder as token_embedder and encoder as bag_of_embedding with embedding dimension 10.

 "model": {
     "type": "simple_classifier",
     "embedder": {
         "token_embedders": {
             "tokens": {
                 "type": "embedding",
                 "embedding_dim": 10
             }
         }
     },
     "encoder": {
         "type": "bag_of_embeddings",
         "embedding_dim": 10
     }
 }

Conclusion:

We build a model using AllenNLP to classify the text data into different labels. Please check the full tutorial here

Access all our open Survey & Awards Nomination forms in one place