MITB Banner

AllenNLP: Quick-start Guide To NLP Research Library

Allen Institute for Artificial Intelligence, which is one of the leading analysis organizations of Artificial Intelligence, develops this PyTorch-based library

Share

AllenNLP is an open-source deep-learning library for NLP. Allen Institute for Artificial Intelligence, which is one of the leading analysis organizations of Artificial Intelligence, develops this PyTorch-based library. It is used for the chatbot development and analysis of text data. AllenNLP has the feature to specialize in research development. It is open for conducting research and publication using this library and for industry-based projects.

In the previous article of Spacy NLP Library, we covered all the nlp tasks such as tokenization, word embedding, and named entity recognition. We  use all these methods in text classification  

and label these texts using the Spacy tokenization method.

Research paper:

  1. https://arxiv.org/abs/1802.05365
  2. https://allennlp.org/papers/AllenNLP_white_paper.pdf

Github: https://github.com/allenai/allennlp

Installation:

Using conda:

 conda create -n allennlp 
 conda activate allennlp
 pip install allennlp 

Using pip:

pip install allennlp

Using git:

git clone https://github.com/allenai/allennlp.git

Using docker:

We can clone the docker container, for deployment please follow the link below.

https://hub.docker.com/r/allennlp/allennlp/dockerfile

DataSet Reader:

It takes raw text data as input to pre-processes that text data such as tokenization and lowercasing and lemmatization into the different object with labels.

In this class, we define two functions first one is for tokenization using SpacyTokensizer, 

We covered all the features of tokenizer and token_index in this blog. The second function is to read the text_file line by line and apply the label to it using the Label field method. Then create different instances.

Code:

 @DatasetReader.register('classification-tsv')
 class ClassificationTsvReader(DatasetReader):
     def __init__(self):
         self.tokenizer = SpacyTokenizer()
         self.token_indexers = {'tokens': SingleIdTokenIndexer()}
     def _read(self, file_path: str) -> Iterable[Instance]:
         with open(file_path, 'r') as lines:
             for line in lines:
                 text, label = line.strip().split('\t')
                 text_field = TextField(self.tokenizer.tokenize(text),
                                        self.token_indexers)
                 label_field = LabelField(label)
                 fields = {'text': text_field, 'label': label_field}
                 yield Instance(fields) 

Model:

The model is to do all this work in text data.

  1. Get some features in your input corresponding to each word 
  2. Merge those word-level characteristics into a vector document-level feature 
  3. Classify the vector of document-level functions into one of your labels.

Model Constructor:

We can take any classifier for the classification such as linear classifier or RNN classifier, and define our vocabulary, TextFieldEmbedder, Seq2VecEncoder to apply the label in the instance.

These are the sequences for the text to label using AllenNLP.

Text → Token IDs → Embeddings →Seq2VecEncoder →label

Code:

 @Model.register('simple_classifier')
 class SimpleClassifier(Model):
 def __init__(self,
          #passing the vocabulary
          vocab: Vocabulary,
          #embedding words
          embedder: TextFieldEmbedder,
          #seq2VecEncoder
          encoder: Seq2VecEncoder):
      super().__init__(vocab)
      #defining embedder
      self.embedder = embedder
      #defining encoder
      self.encoder = encoder
      #passing the vocabulary  with labels
      num_labels = vocab.get_vocab_size("labels")
      #classification layer
      self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels) 

We also have to define some loss function, forward method, and train model function.

The forward method is just a PyTorch function and loss is calculated for the optimization of the model.

Code:

We are going to construct the forward model

#defining Inputs to forward()

 def forward(self,
     text: Dict[str, torch.Tensor],
     label: torch.Tensor) -> Dict[str, torch.Tensor] 

#embedding text to the model 

 # Shape: (batch_size, num_tokens, embedding_dim)
 embedded_text = self.embedder(text) 

#applying seq2vec encoder 

 # Shape: (batch_size, num_tokens)
 mask = util.get_text_field_mask(text)
 # Shape: (batch_size, encoding_dim)
 encoded_text = self.encoder(embedded_text, mask) 

#prediction layer

 # Shape: (batch_size, num_labels)
 logits = self.classifier(encoded_text)
 # Shape: (batch_size, num_labels)
 probs = torch.nn.functional.softmax(logits)
 # Shape: (1,)
 loss = torch.nn.functional.cross_entropy(logits, label)
 return {'loss': loss, 'probs': probs} 

#Final model.forward()

 class SimpleClassifier(Model):
 def forward(self,
              text: Dict[str, torch.Tensor],
              label: torch.Tensor) -> Dict[str, torch.Tensor]:
      # Shape: (batch_size, num_tokens, embedding_dim)
      embedded_text = self.embedder(text)
      # Shape: (batch_size, num_tokens)
      mask = util.get_text_field_mask(text)
      # Shape: (batch_size, encoding_dim)
      encoded_text = self.encoder(embedded_text, mask)
      # Shape: (batch_size, num_labels)
      logits = self.classifier(encoded_text)
      # Shape: (batch_size, num_labels)
      probs = torch.nn.functional.softmax(logits)
      # Shape: (1,)
      loss = torch.nn.functional.cross_entropy(logits, label)
      return {'loss': loss, 'probs': probs} 

Configuration files:

We already define our model classifier now to check the configuration of a text file into the label we build a model for it to the embedder and Encoder format of the data as a JSON format.  

Code:

 def build_model(vocab: Vocabulary) -> Model:
     print("Building the model")
     vocab_size = vocab.get_vocab_size("tokens")
     embedder = BasicTextFieldEmbedder(
         {"tokens": Embedding(embedding_dim=10, num_embeddings=vocab_size)})
     encoder = BagOfEmbeddingsEncoder(embedding_dim=10)
     return SimpleClassifier(vocab, embedder, encoder) 

The output of JSON:

The JSON represents the model file as simple_classifier embedder as token_embedder and encoder as bag_of_embedding with embedding dimension 10.

 "model": {
     "type": "simple_classifier",
     "embedder": {
         "token_embedders": {
             "tokens": {
                 "type": "embedding",
                 "embedding_dim": 10
             }
         }
     },
     "encoder": {
         "type": "bag_of_embeddings",
         "embedding_dim": 10
     }
 } 

Conclusion:

We build a model using AllenNLP to classify the text data into different labels. Please check the full tutorial here

Share
Picture of Amit Singh

Amit Singh

Amit Singh is Data Scientist, graduated in Computer Science and Engineering. Data Science writer at Analytics India Magazine.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.