Active Hackathon

AllenNLP: Quick-start Guide To NLP Research Library

Allen Institute for Artificial Intelligence, which is one of the leading analysis organizations of Artificial Intelligence, develops this PyTorch-based library

AllenNLP is an open-source deep-learning library for NLP. Allen Institute for Artificial Intelligence, which is one of the leading analysis organizations of Artificial Intelligence, develops this PyTorch-based library. It is used for the chatbot development and analysis of text data. AllenNLP has the feature to specialize in research development. It is open for conducting research and publication using this library and for industry-based projects.

In the previous article of Spacy NLP Library, we covered all the nlp tasks such as tokenization, word embedding, and named entity recognition. We  use all these methods in text classification  

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

and label these texts using the Spacy tokenization method.

Research paper:

  1. https://arxiv.org/abs/1802.05365
  2. https://allennlp.org/papers/AllenNLP_white_paper.pdf

Github: https://github.com/allenai/allennlp

Installation:

Using conda:

 conda create -n allennlp 
 conda activate allennlp
 pip install allennlp 

Using pip:

pip install allennlp

Using git:

git clone https://github.com/allenai/allennlp.git

Using docker:

We can clone the docker container, for deployment please follow the link below.

https://hub.docker.com/r/allennlp/allennlp/dockerfile

DataSet Reader:

It takes raw text data as input to pre-processes that text data such as tokenization and lowercasing and lemmatization into the different object with labels.

In this class, we define two functions first one is for tokenization using SpacyTokensizer, 

We covered all the features of tokenizer and token_index in this blog. The second function is to read the text_file line by line and apply the label to it using the Label field method. Then create different instances.

Code:

 @DatasetReader.register('classification-tsv')
 class ClassificationTsvReader(DatasetReader):
     def __init__(self):
         self.tokenizer = SpacyTokenizer()
         self.token_indexers = {'tokens': SingleIdTokenIndexer()}
     def _read(self, file_path: str) -> Iterable[Instance]:
         with open(file_path, 'r') as lines:
             for line in lines:
                 text, label = line.strip().split('\t')
                 text_field = TextField(self.tokenizer.tokenize(text),
                                        self.token_indexers)
                 label_field = LabelField(label)
                 fields = {'text': text_field, 'label': label_field}
                 yield Instance(fields) 

Model:

The model is to do all this work in text data.

  1. Get some features in your input corresponding to each word 
  2. Merge those word-level characteristics into a vector document-level feature 
  3. Classify the vector of document-level functions into one of your labels.

Model Constructor:

We can take any classifier for the classification such as linear classifier or RNN classifier, and define our vocabulary, TextFieldEmbedder, Seq2VecEncoder to apply the label in the instance.

These are the sequences for the text to label using AllenNLP.

Text → Token IDs → Embeddings →Seq2VecEncoder →label

Code:

 @Model.register('simple_classifier')
 class SimpleClassifier(Model):
 def __init__(self,
          #passing the vocabulary
          vocab: Vocabulary,
          #embedding words
          embedder: TextFieldEmbedder,
          #seq2VecEncoder
          encoder: Seq2VecEncoder):
      super().__init__(vocab)
      #defining embedder
      self.embedder = embedder
      #defining encoder
      self.encoder = encoder
      #passing the vocabulary  with labels
      num_labels = vocab.get_vocab_size("labels")
      #classification layer
      self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels) 

We also have to define some loss function, forward method, and train model function.

The forward method is just a PyTorch function and loss is calculated for the optimization of the model.

Code:

We are going to construct the forward model

#defining Inputs to forward()

 def forward(self,
     text: Dict[str, torch.Tensor],
     label: torch.Tensor) -> Dict[str, torch.Tensor] 

#embedding text to the model 

 # Shape: (batch_size, num_tokens, embedding_dim)
 embedded_text = self.embedder(text) 

#applying seq2vec encoder 

 # Shape: (batch_size, num_tokens)
 mask = util.get_text_field_mask(text)
 # Shape: (batch_size, encoding_dim)
 encoded_text = self.encoder(embedded_text, mask) 

#prediction layer

 # Shape: (batch_size, num_labels)
 logits = self.classifier(encoded_text)
 # Shape: (batch_size, num_labels)
 probs = torch.nn.functional.softmax(logits)
 # Shape: (1,)
 loss = torch.nn.functional.cross_entropy(logits, label)
 return {'loss': loss, 'probs': probs} 

#Final model.forward()

 class SimpleClassifier(Model):
 def forward(self,
              text: Dict[str, torch.Tensor],
              label: torch.Tensor) -> Dict[str, torch.Tensor]:
      # Shape: (batch_size, num_tokens, embedding_dim)
      embedded_text = self.embedder(text)
      # Shape: (batch_size, num_tokens)
      mask = util.get_text_field_mask(text)
      # Shape: (batch_size, encoding_dim)
      encoded_text = self.encoder(embedded_text, mask)
      # Shape: (batch_size, num_labels)
      logits = self.classifier(encoded_text)
      # Shape: (batch_size, num_labels)
      probs = torch.nn.functional.softmax(logits)
      # Shape: (1,)
      loss = torch.nn.functional.cross_entropy(logits, label)
      return {'loss': loss, 'probs': probs} 

Configuration files:

We already define our model classifier now to check the configuration of a text file into the label we build a model for it to the embedder and Encoder format of the data as a JSON format.  

Code:

 def build_model(vocab: Vocabulary) -> Model:
     print("Building the model")
     vocab_size = vocab.get_vocab_size("tokens")
     embedder = BasicTextFieldEmbedder(
         {"tokens": Embedding(embedding_dim=10, num_embeddings=vocab_size)})
     encoder = BagOfEmbeddingsEncoder(embedding_dim=10)
     return SimpleClassifier(vocab, embedder, encoder) 

The output of JSON:

The JSON represents the model file as simple_classifier embedder as token_embedder and encoder as bag_of_embedding with embedding dimension 10.

 "model": {
     "type": "simple_classifier",
     "embedder": {
         "token_embedders": {
             "tokens": {
                 "type": "embedding",
                 "embedding_dim": 10
             }
         }
     },
     "encoder": {
         "type": "bag_of_embeddings",
         "embedding_dim": 10
     }
 } 

Conclusion:

We build a model using AllenNLP to classify the text data into different labels. Please check the full tutorial here

More Great AIM Stories

Amit Singh
Amit Singh is Data Scientist, graduated in Computer Science and Engineering. Data Science writer at Analytics India Magazine.

Our Upcoming Events

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

Council Post: Enabling a Data-Driven culture within BFSI GCCs in India

Data is the key element across all the three tenets of engineering brilliance, customer-centricity and talent strategy and engagement and will continue to help us deliver on our transformation agenda. Our data-driven culture fosters continuous performance improvement to create differentiated experiences and enable growth.

Ouch, Cognizant

The company has reduced its full-year 2022 revenue growth guidance to 8.5% – 9.5% in constant currency from the 9-11% in the previous quarter