MITB Banner

Watch More

Guide To Question-Answering System With T5 Transformer

Question Answering is a classical Natural Language Processing. This is a task involving a question being asked to a system from a set of documents or text and should be able to answer that question.

As history has proven, computer science has always helped us in making our lives easier. With the assistance of web search engines, we can get any information at our fingertips. Believe it or not, but the world is trying to make computers act more intelligent. However, one of the key barriers faced by researchers and pioneers is Natural Language Understanding

Question Answering is a classical Natural Language Processing. This is a task involving a question being asked to a system from a set of documents or text and should be able to answer that question. We have seen this application countless times in chatbots on websites. These are so good that many times you cannot tell if it’s a chatbot or a person.  

There are numerous examples of trivia bots that act like quizzing opponents; trivia is a general knowledge question answering the test. Chatbots have to pass the Turing test, which involves a chatbot on one side and a human on the other. The human doesn’t know who is on the other side and to tell if there is a chatbot or a person like them. If the human interpreter fails to do so and there is, in fact, a chatbot at work on the other side, it is a win-win for the chatbot. 

Question Answering is very much dependent on a well n good corpus or dataset, you may say. It makes sense to have larger collection sizes which generally lend to better question answering performance until and unless the question domain is orthogonal or quite different from the corpus. Some may think that data redundancy might be a problem because of creating a corpus from the web. Nuggets (small baskets of similar data) of information are likely to be formed in various ways and contexts, greatly benefiting.

Here in this article, we’ll be making a Question-Answering system using T5 Transformer, a state-of-the-art Text to Text transformer developed by Google AI. This transformer has many features and is already trained on the C4 data set (Colossal Clean Common Crawl), around 750 Gigabytes of a text corpus.  You may read about this T5 transformer here in one of my articles.

Let’s dive into the code implementation for making a Q/A system. This will be implemented in a Google Colab notebook whose link will be given below.

Code Implementation of Question Answering with T5 Transformer

Importing Libraries and Dependencies

Make sure the GPU is on in the runtime, that too at the start of the notebook, else it will restart all cells again. If not, then follow this.

Runtime -> Change Runtime -> GPU.

 # check for the GPU provided in the runtime
 !nvidia-smi
 # using quiet method for controlling the log
 # for suppressing the colored errors and warning in the terminal
 !pip install --quiet transformers==4.1.1
 # pytorch lightning for smoother model training and data loading
 !pip install --quiet https://github.com/PyTorchLightning/pytorch-lightning/releases/download/1.2.6/pytorch-lightning-1.2.6.tar.gz 
 # using HuggingFace tokenizers
 !pip install --quiet tokenizers==0.9.4
 # Google's sentencepiece
 !pip install --quiet sentencepiece==0.1.94
 # mostly pl is used while doing complex model training
 import pytorch_lightning as pl
 print(pl.__version__)
 # argparse makes it easier to write user friendly command line interfaces
 import argparse
 # package for faster file name matching
 import glob
 # makiing directories for data 
 import os
 # reading json files as the data is present in json files
 import json
 # time module for calculating the model runtime
 import time
 # Allows writing status messages to a file
 import logging
 # generate random float numbers uniformly
 import random
 # regex module for text 
 import re
 # module provides various functions which work on 
 # iterators too produce complex iterators
 from itertools import chain
 from string import punctuation
 # pandas for data manipulation
 import pandas as pd
 # numpy for array operations
 import numpy as np
 # PyTorch
 import torch
 # provides various classes representing file system paths
 # with appropriate semantics
 from pathlib import Path
 from torch.utils.data import Dataset, DataLoader
 import pytorch_lightning as pl
 # splitting the data 
 from sklearn.model_selection import train_test_split
 # ANSII color formatting for ouput in terminal
 from termcolor import colored
 # wrapping paragraphs into string
 import textwrap
 # model checkpoints in pretrained model
 from pytorch_lightning.callbacks import ModelCheckpoint
 '''
 optimizer - AdamW
 T5 Conditional Generator in which we'll give conditions
 T5 tokenizer because it is fast
 training the model without a learning rate
 '''
 from transformers import (
     AdamW,
     T5ForConditionalGeneration,
     T5Tokenizer,
     get_linear_schedule_with_warmup
 )
 # Seeds all the processes including numpy torch and other imported modules.
 pl.seed_everything(0)
 # check the version provided by Lightning
 import pytorch_lightning as pl
 print(pl.__version__) 
Downloading the Dataset
 # QA dataset from https://github.com/dmis-lab/bioasq-biobert
 # which is in Zip format
 !gdown --id 1mxVUywvKzvA9bvrUc11RYuOTy7MYcXHF
 # Unzipping the folder
 !unzip -q bio-QA.zip
 # let's have a look at one of the files
 with Path("BioASQ/BioASQ-train-factoid-4b.json").open() as json_file:
   data = json.load(json_file)     
 # Data is a dictionary
 data.keys()
 Let’s have a look at how the data is stored and in what format.
 Data['version']
 # len of each file
 len(data['data'])
 # We have a list of dictionaries in the "data". We can explore the 0th element
 data['data'][0].keys()
 data['data'][0]['title']
 len(data['data'][0]['paragraphs'])
 questions = data['data'][0]['paragraphs']
 # datapoint sample
 questions[0] 
Function to Create Pandas Dataframes of Questions and Answers

This function will help us read the data from the folder containing multiple JSON files and read them to a dataframe to run manipulation on it and proceed further. 

 def extract_questions_and_answers(factoid_path = Path):
   with factoid_path.open() as json_file:
     data = json.load(json_file)
     questions = data['data'][0]['paragraphs']
     data_rows = []
     for question in questions:
       context = question['context']
       for question_and_answers in question['qas']:
         question = question_and_answers['question']
         answers = question_and_answers['answers']
         for answer in answers:
           answer_text = answer['text']
           answer_start = answer['answer_start']
           answer_end = answer['answer_start'] + len(answer_text)  #Gets the end index of each answer in the paragraph
           data_rows.append({
                 "question" : question,
                 "context"  : context,
                 "answer_text" : answer_text,
                 "answer_start" : answer_start,
                 "answer_end" : answer_end
             })
     return pd.DataFrame(data_rows)
 factoid_path = Path("BioASQ/BioASQ-train-factoid-4b.json")
 extract_questions_and_answers(factoid_path).head()      
 factoid_paths = sorted(list(Path('BioASQ/').glob('BioASQ-train-*')))
 factoid_paths
 dfs = []
 for factoid_path in factoid_paths:
   df = extract_questions_and_answers(factoid_path)
   dfs.append(df)
 df = pd.concat(dfs)
 dfs = []
 df.head()
 df.shape
 # Dropping all the rows with repeated context and questions pairs.
 df = df.drop_duplicates(subset=["context"]).reset_index(drop=True)
 df.shape
 len(df.question.unique())
 len(df.context.unique())
 sample_question = df.iloc[243]
 sample_question
 # Using textcolor to visualize the answer within the context
 def color_answer(question):
   answer_start, answer_end = question["answer_start"],question["answer_end"]
   context = question['context']
   return  colored(context[:answer_start], "white") + \
     colored(context[answer_start:answer_end + 1], "green") + \
     colored(context[answer_end+1:], "white")
 print(sample_question['question'])
 print()
 print("Answer: ")
 for wrap in textwrap.wrap(color_answer(sample_question), width = 100):
   print(wrap) 
Tokenization

In the following cells, we have instantiated the model and called its tokenizer. T5 tokenizer is pretty fast as compared to other BERT type tokenizers. We will run a sample of this on the text given below and do the decoding.

 # using the base T5 model having 222M params
 MODEL_NAME ='t5-base' 
 tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
 sample_encoding = tokenizer('is the glass half empty or half full?', 'It depends on the initial state of the glass. If the glass starts out empty and liquid is added until it is half full, it is half full. If the glass starts out full and liquid is removed until it is half empty, it is half empty.')
 sample_encoding.keys()
 print(sample_encoding["input_ids"])
 print(sample_encoding["attention_mask"])
 print(len(sample_encoding['input_ids']), len(sample_encoding['attention_mask']))
 # Checking the decoding of the input ids
 preds = [
          tokenizer.decode(input_id, skip_special_tokens=True, clean_up_tokenization_spaces=True)
          for input_id in sample_encoding['input_ids']
 ]
 preds= " ".join(preds)
 for wrap in textwrap.wrap(preds, width = 80):
   print(wrap)
 encoding = tokenizer(
     sample_question['question'],
     sample_question['context'],
     max_length=396,
     padding='max_length',
     truncation="only_second",
     return_attention_mask=True,
     add_special_tokens=True,
     return_tensors="pt"
 )
 encoding.keys()
 tokenizer.special_tokens_map
 tokenizer.eos_token, tokenizer.eos_token_id
 # Input id of 1 represents end of sequence token.
 # Text representation pf the input ids
 tokenizer.decode(encoding['input_ids'].squeeze()) 
Creating labels for the answers

In the following cells, We have to create necessary labels for the answers. This is required so that we can extract answers accordingly to the questions.

 answer_encoding = tokenizer(
     sample_question['answer_text'],
     max_length=32,
     padding='max_length',
     truncation=True,
     return_attention_mask=True,
     add_special_tokens=True,
     return_tensors="pt"
 )
 tokenizer.decode(answer_encoding['input_ids'].squeeze())
 labels = answer_encoding["input_ids"]
 labels
 labels[labels == 0] = -100
 labels 
Create dataset

In the following cells, we have created the dataset for input in the model. This required setting the lengths, padding and tokenizer. The data has been taken for a BioQA dataset which is specifically for this task. 

 class BioQADataset(Dataset):
   def __init__(
       self,
       data:pd.DataFrame,
       tokenizer:T5Tokenizer,
       source_max_token_len: int = 396,
       target_max_token_len: int = 32,
       ):
     self.data =  data
     self.tokenizer =  tokenizer
     self.source_max_token_len =  source_max_token_len
     self.target_max_token_len =  target_max_token_len
   def __len__(self):
     return len(self.data)
   def __getitem__(self, index: int):
     data_row = self.data.iloc[index]
     source_encoding = tokenizer(
       data_row['question'],
       data_row['context'],
       max_length=self.source_max_token_len,
       padding='max_length',
       truncation="only_second",
       return_attention_mask=True,
       add_special_tokens=True,
       return_tensors="pt"
       )
     target_encoding = tokenizer(
       data_row['answer_text'],
       max_length=self.target_max_token_len,
       padding='max_length',
       truncation=True,
       return_attention_mask=True,
       add_special_tokens=True,
       return_tensors="pt"
       )
     labels = target_encoding['input_ids']
     labels[labels==0] = -100
     return dict(
         question=data_row['question'],
         context=data_row['context'],
         answer_text=data_row['answer_text'],
         input_ids=source_encoding["input_ids"].flatten(),
         attention_mask=source_encoding['attention_mask'].flatten(),
         labels=labels.flatten()
     )
 sample_dataset = BioQADataset(df, tokenizer)
 for data in sample_dataset:
   print("Question: ", data['question'])
   print("Answer text: ", data['answer_text'])
   print("Input_ids: ", data['input_ids'][:10])
   print("Labels: ", data['labels'][:10])
   break 
Split Dataset into Train and Test

In the following cells, we have split the data into two parts. Test size is small due to the heavy model needed large data for training.

 train_df, val_df = train_test_split(df, test_size=0.05)
 train_df.shape,  val_df.shape 
Create Pytorch Lightning Module

In the following cells, we have created the lightning module. This is made when we have complex models. Hence Pytorch has introduced this to smoothen the process.

 class BioDataModule(pl.LightningDataModule):
   def __init__(
       self,
       train_df: pd.DataFrame,
       test_df: pd.DataFrame,
       tokenizer:T5Tokenizer,
       batch_size: int = 8,
       source_max_token_len: int = 396,
       target_max_token_len: int = 32,
       ):
     super().__init__()
     self.train_df = train_df
     self.test_df = test_df
     self.tokenizer = tokenizer
     self.batch_size = batch_size
     self.source_max_token_len = source_max_token_len
     self.target_max_token_len = target_max_token_len
   def setup(self):
     self.train_dataset = BioQADataset(
         self.train_df,
         self.tokenizer,
         self.source_max_token_len,
         self.target_max_token_len
         )
     self.test_dataset = BioQADataset(
     self.test_df,
     self.tokenizer,
     self.source_max_token_len,
     self.target_max_token_len
     )
   def train_dataloader(self):
     return DataLoader(
         self.train_dataset,
         batch_size=self.batch_size,
         shuffle=True,
         num_workers=4
         )
   def val_dataloader(self):
     return DataLoader(
         self.test_dataset,
         batch_size=self.batch_size,
         num_workers=4
         )
   def test_dataloader(self):
     return DataLoader(
         self.test_dataset,
         batch_size=1,
         num_workers=4
         )
 BATCH_SIZE = 4
 N_EPOCHS = 6
 data_module = BioDataModule(train_df, val_df, tokenizer, batch_size=BATCH_SIZE)
 data_module.setup() 
Loading and Fine Tuning T5

In the following cells, we have installed the model and fine tuned it to our application requirements. 

 model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME, return_dict = True)
 model.config
 # To check the translation from English to German built-in task 
 input_ids_translated = tokenizer(
     "translate English to German : Oppertunity did not knock until I built a door",
     return_tensors = 'pt'
 ).input_ids
 generated_ids = model.generate(input_ids = input_ids_translated)
 generated_ids
 pred_translated = [
          tokenizer.decode(gen_id, skip_special_tokens=True, clean_up_tokenization_spaces=True)
          for gen_id in generated_ids
 ]
 pred_translated
 "".join(pred_translated)
 input_ids_summary = tokenizer(
     text,
     return_tensors = 'pt'
 ).input_ids
 generated_ids_summary = model.generate(input_ids = input_ids_summary)
 generated_ids_summary
 pred_summary = [
          tokenizer.decode(gen_id, skip_special_tokens=True, clean_up_tokenization_spaces=True)
          for gen_id in generated_ids
 ]
 " ".join(pred_summary)
 # Model config
 Model.config 
Building the PyTorch lightning module using T5ForConditionalGeneration model

In the following cells, we have leveraged a T5 Conditional Generator which will produce text based on some condition (this is going to be our application task – QA).

 class BioQAModel(pl.LightningModule):
   def __init__(self):
     super().__init__()
     self.model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME, return_dict=True)
   def forward(self, input_ids, attention_mask, labels=None):
     output = self.model(
         input_ids, 
         attention_mask=attention_mask,
         labels=labels)
     return output.loss, output.logits
   def training_step(self, batch, batch_idx):
     input_ids = batch['input_ids']
     attention_mask=batch['attention_mask']
     labels = batch['labels']
     loss, outputs = self(input_ids, attention_mask, labels)
     self.log("train_loss", loss, prog_bar=True, logger=True)
     return {"loss": loss, "predictions":outputs, "labels": labels}
   def validation_step(self, batch, batch_idx):
     input_ids = batch['input_ids']
     attention_mask=batch['attention_mask']
     labels = batch['labels']
     loss, outputs = self(input_ids, attention_mask, labels)
     self.log("val_loss", loss, prog_bar=True, logger=True)
     return loss
   def test_step(self, batch, batch_idx):
     input_ids = batch['input_ids']
     attention_mask=batch['attention_mask']
     labels = batch['labels']
     loss, outputs = self(input_ids, attention_mask, labels)
     self.log("test_loss", loss, prog_bar=True, logger=True)
     return loss
   def configure_optimizers(self):
     optimizer = AdamW(self.parameters(), lr=0.0001)
     return optimizer
 model = BioQAModel()  
Using Trainer from PyTorch-Lightning to Finetune Model Using our Dataset

We have used a trainer module to test the model and fine-tune it in the following cells. This is helpful when the model is huge, and you want to get a warmup of the model without changing the learning rate. 

 model = BioQAModel()
 # To record the best performing model using checkpoint
 checkpoint_callback = ModelCheckpoint(
     dirpath="checkpoints",
     filename="best-checkpoint",
     save_top_k=1,
     verbose=True,
     monitor="val_loss",
     mode="min"
 )
 #logger = TensorBoardLogger("training-logs", name="bio-qa")
 #logger = TensorBoardLogger("training-logs", name="bio-qa")
 trainer = pl.Trainer(
     #logger = logger,
     checkpoint_callback=checkpoint_callback,
     max_epochs=N_EPOCHS,
     gpus=1,
     progress_bar_refresh_rate = 30
 ) 
Loading Tensorboard

In the following cells, we have used a Tensorboard for seeing how the model is progressing against time and epochs. This is very helpful when running huge models. 

 %load_ext tensorboard
 %tensorboard --logdir ./lightning_logs
 #!rm --rf lightning_logs
 trainer.fit(model, data_module)
 trainer.test()  # evaluate the model according to the last checkpoint 
Predictions
 trained_model = BioQAModel.load_from_checkpoint("checkpoints/best-checkpoint.ckpt")
 trained_model.freeze() #  
Generate Answers for the Questions in the Validation Set

In the following cells, we have tried and tested some questions whose answers were saved with us in the valid dataset against the predicted values given by the model. As you’ll notice the T5 model is so powerful that it will predict exact values.

 def generate_answer(question):
   source_encoding=tokenizer(
       question["question"],
       question['context'],
       max_length = 396,
       padding="max_length",
       truncation="only_second",
       return_attention_mask=True,
       add_special_tokens=True,
       return_tensors="pt"
   )
   generated_ids = trained_model.model.generate(
       input_ids=source_encoding["input_ids"],
       attention_mask=source_encoding["attention_mask"],
       num_beams=1,  # greedy search
       max_length=80,
       repetition_penalty=2.5,
       early_stopping=True,
       use_cache=True)
   preds = [
           tokenizer.decode(generated_id, skip_special_tokens=True, clean_up_tokenization_spaces=True)
           for generated_id in generated_ids
   ]
   return "".join(preds)
 sample_question = val_df.iloc[20]
 sample_question["question"]
 sample_question["answer_text"]  # Label Answer
 generate_answer(sample_question)  # Predicted answer
 sample_question = val_df.iloc[66]
 sample_question["answer_text"]
 generate_answer(sample_question) 
 #mkdir zip
 !zip -r /content.zip /content
 from google.colab import files
 files.download("/content.zip") 

EndNote

We have above successfully implemented a Question-Answering system using T5 Transformer and Pytorch Lightning. I highly recommend using quiz data to fine-tune this model to make a trivia bot. Datasets for this specific task are provided here, which have been taken from competitions.

References:

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Mudit Rustagi

Mudit Rustagi

Mudit is experienced in machine learning and deep learning. He is an undergraduate in Mechatronics and worked as a team lead (ML team) for several Projects. He has a strong interest in doing SOTA ML projects and writing blogs on data science and machine learning.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories