AllenNLP is an open-source deep-learning library for NLP. Allen Institute for Artificial Intelligence, which is one of the leading analysis organizations of Artificial Intelligence, develops this PyTorch-based library. It is used for the chatbot development and analysis of text data. AllenNLP has the feature to specialize in research development. It is open for conducting research and publication using this library and for industry-based projects.
In the previous article of Spacy NLP Library, we covered all the nlp tasks such as tokenization, word embedding, and named entity recognition. We use all these methods in text classification
and label these texts using the Spacy tokenization method.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Research paper:
Github: https://github.com/allenai/allennlp
Download our Mobile App
Installation:
Using conda:
conda create -n allennlp conda activate allennlp pip install allennlp
Using pip:
pip install allennlp
Using git:
git clone https://github.com/allenai/allennlp.git
Using docker:
We can clone the docker container, for deployment please follow the link below.
https://hub.docker.com/r/allennlp/allennlp/dockerfile
DataSet Reader:
It takes raw text data as input to pre-processes that text data such as tokenization and lowercasing and lemmatization into the different object with labels.
In this class, we define two functions first one is for tokenization using SpacyTokensizer,
We covered all the features of tokenizer and token_index in this blog. The second function is to read the text_file line by line and apply the label to it using the Label field method. Then create different instances.
Code:
@DatasetReader.register('classification-tsv') class ClassificationTsvReader(DatasetReader): def __init__(self): self.tokenizer = SpacyTokenizer() self.token_indexers = {'tokens': SingleIdTokenIndexer()} def _read(self, file_path: str) -> Iterable[Instance]: with open(file_path, 'r') as lines: for line in lines: text, label = line.strip().split('\t') text_field = TextField(self.tokenizer.tokenize(text), self.token_indexers) label_field = LabelField(label) fields = {'text': text_field, 'label': label_field} yield Instance(fields)
Model:
The model is to do all this work in text data.
- Get some features in your input corresponding to each word
- Merge those word-level characteristics into a vector document-level feature
- Classify the vector of document-level functions into one of your labels.
We can take any classifier for the classification such as linear classifier or RNN classifier, and define our vocabulary, TextFieldEmbedder, Seq2VecEncoder to apply the label in the instance.
These are the sequences for the text to label using AllenNLP.
Text → Token IDs → Embeddings →Seq2VecEncoder →label
Code:
@Model.register('simple_classifier') class SimpleClassifier(Model): def __init__(self, #passing the vocabulary vocab: Vocabulary, #embedding words embedder: TextFieldEmbedder, #seq2VecEncoder encoder: Seq2VecEncoder): super().__init__(vocab) #defining embedder self.embedder = embedder #defining encoder self.encoder = encoder #passing the vocabulary with labels num_labels = vocab.get_vocab_size("labels") #classification layer self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)
We also have to define some loss function, forward method, and train model function.
The forward method is just a PyTorch function and loss is calculated for the optimization of the model.
Code:
We are going to construct the forward model
#defining Inputs to forward()
def forward(self, text: Dict[str, torch.Tensor], label: torch.Tensor) -> Dict[str, torch.Tensor]
#embedding text to the model
# Shape: (batch_size, num_tokens, embedding_dim) embedded_text = self.embedder(text)
#applying seq2vec encoder
# Shape: (batch_size, num_tokens) mask = util.get_text_field_mask(text) # Shape: (batch_size, encoding_dim) encoded_text = self.encoder(embedded_text, mask)
#prediction layer
# Shape: (batch_size, num_labels) logits = self.classifier(encoded_text) # Shape: (batch_size, num_labels) probs = torch.nn.functional.softmax(logits) # Shape: (1,) loss = torch.nn.functional.cross_entropy(logits, label) return {'loss': loss, 'probs': probs}
#Final model.forward()
class SimpleClassifier(Model): def forward(self, text: Dict[str, torch.Tensor], label: torch.Tensor) -> Dict[str, torch.Tensor]: # Shape: (batch_size, num_tokens, embedding_dim) embedded_text = self.embedder(text) # Shape: (batch_size, num_tokens) mask = util.get_text_field_mask(text) # Shape: (batch_size, encoding_dim) encoded_text = self.encoder(embedded_text, mask) # Shape: (batch_size, num_labels) logits = self.classifier(encoded_text) # Shape: (batch_size, num_labels) probs = torch.nn.functional.softmax(logits) # Shape: (1,) loss = torch.nn.functional.cross_entropy(logits, label) return {'loss': loss, 'probs': probs}
Configuration files:
We already define our model classifier now to check the configuration of a text file into the label we build a model for it to the embedder and Encoder format of the data as a JSON format.
Code:
def build_model(vocab: Vocabulary) -> Model: print("Building the model") vocab_size = vocab.get_vocab_size("tokens") embedder = BasicTextFieldEmbedder( {"tokens": Embedding(embedding_dim=10, num_embeddings=vocab_size)}) encoder = BagOfEmbeddingsEncoder(embedding_dim=10) return SimpleClassifier(vocab, embedder, encoder)
The output of JSON:
The JSON represents the model file as simple_classifier embedder as token_embedder and encoder as bag_of_embedding with embedding dimension 10.
"model": { "type": "simple_classifier", "embedder": { "token_embedders": { "tokens": { "type": "embedding", "embedding_dim": 10 } } }, "encoder": { "type": "bag_of_embeddings", "embedding_dim": 10 } }
Conclusion:
We build a model using AllenNLP to classify the text data into different labels. Please check the full tutorial here