Most Benchmarked Datasets for Question Answering in NLP with implementation in PyTorch, Keras, and TensorFlow

Question Answering is a technique inside the fields of natural language processing, which is concerned about building frameworks that consequently answer addresses presented by people in a natural language processing.

Question Answering is a technique inside the fields of natural language processing, which is concerned about building frameworks that consequently answer addresses presented by people in natural language processing. The capacity to peruse the content and afterward answer inquiries concerning it, is a difficult undertaking for machines, requiring information about the world. Existing datasets for Question answering have two main weaknesses: those that are used in training data are excessively little for preparing present-day information, while those that are enormous don’t have similar attributes as express perusing comprehension questions.

To address the need for large and high-quality Question answering datasets, we will discuss some of the popular datasets and their code implementation using TensorFlow and Pytorch. Further, we will discuss some of the benchmark models that gave high accuracy on these datasets.


Stanford Question Answering Dataset (SQuAD) is a dataset comprising 100,000+ inquiries presented by crowd workers on a bunch of Wikipedia articles, where the response to each address is a fragment of text from the comparing understanding entry. The dataset was presented by researchers: Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang from Stanford University.

Loading the dataset using PyTorch

import json
from import download_file_maybe_extract
def squad_dataset(directory='data/',
 download_file_maybe_extract(url=url_dev, directory=directory, check_files=check_files_dev)
 download_file_maybe_extract(url=url_train, directory=directory, check_files=check_files_train)
    squad= []
    splits_text = [(train, train_filename), (dev, dev_filename)]
    splits_text = [f for (requested, f) in splits_text if requested]
    for filename in splits_text :
        full_path = os.path.join(directory, filename)
        with open(full_path, 'r') as temp:
    if len(squad) == 1:
        return ret[0]
        return tuple(squad)

Loading the dataset using TensorFlow

import tensorflow as tf
def squad(path):
  data =
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
  data = data.filter(content_filter)
  data = x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= squad('')

State of the Art

The current state of the art on SQuAD dataset is SA-Net on Albert. The model gave an F1 score of 93.011.



The bAbI-Question Answering is a dataset for question noting and text understanding. The dataset is made out of a bunch of contexts, with numerous inquiry answer sets accessible depending on the specific situations. It contains both English and Hindi content. The “ContentElements” field contains training data and testing data. The initial two give admittance to information designed to normal preparing errands. They are retrieved from the 10,000k variant in English.bAbI was presented by Facebook Group.

Loading the dataset using PyTorch

import os
from io import open
import torch
from import Dataset, Field, Example, Iterator
class BABI20Field(Field):
    def __init__(self, memory_size, **kwargs):
        super(BABI20Field, self).__init__(**kwargs)
        self.memory_size = memory_size
        self.unk_token = None
        self.batch_first = True
    def preprocess(self, x):
        if isinstance(x, list):
            return [super(BABI20Field, self).preprocess(s) for s in x]
            return super(BABI20Field, self).preprocess(x)
    def pad(self, minibatch):
        if isinstance(minibatch[0][0], list):
            self.fix_length = max(max(len(x) for x in ex) for ex in minibatch)
            padded = []
            for ex in minibatch:
                # sentences are indexed in reverse order and truncated to memory_size
                nex = ex[::-1][:self.memory_size]
                    super(BABI20Field, self).pad(nex)
                    + [[self.pad_token] * self.fix_length]
                    * (self.memory_size - len(nex)))
            self.fix_length = None
            return padded
            return super(BABI20Field, self).pad(minibatch)
    def numericalize(self, arr, device=None):
        if isinstance(arr[0][0], list):
            tmp = [
                super(BABI20Field, self).numericalize(x, device=device).data
                for x in arr
            arr = torch.stack(tmp)
            if self.sequential:
                arr = arr.contiguous()
            return arr
            return super(BABI20Field, self).numericalize(arr, device=device)
class BABI20(Dataset):
    urls = ['']
    name = ''
    dirname = ''
    def __init__(self, path, text_field, only_supporting=False, **kwargs):
        fields = [('story', text_field), ('query', text_field), ('answer', text_field)]
        self.sort_key = lambda x: len(x.query)
        with open(path, 'r', encoding="utf-8") as f:
            triplets = self._parse(f, only_supporting)
            examples = [Example.fromlist(triplet, fields) for triplet in triplets]
        super(BABI20, self).__init__(examples, fields, **kwargs)
    def _parse(file, only_supporting):
        datanew, parse_story = [], []
        for line in file:
            tid, text = line.rstrip('\n').split(' ', 1)
            if tid == '1':
                parse_story = []
            if text.endswith('.'):
                query, answer, supporting = (x.strip() for x in text.split('\t'))
                if only_supporting:
                    substory = [parse_story[int(i) - 1] for i in supporting.split()]
                    substory = [x for x in story if x]
                datanew.append((substory, query[:-1], answer))    # remove '?'
        return datanew
    def iters(cls, batch_size=32, root='.data', memory_size=50, task=1, joint=False,
              tenK=False, only_supporting=False, sort=False, shuffle=False, device=None,
        textnew = BABI20Field(memory_size)
        train, val, test = BABI20.splits(textnew, root=root, task=task, joint=joint,
                                         tenK=tenK, only_supporting=only_supporting,
        return Iterator.splits((train, val, test), batch_size=batch_size, sort=sort,
                               shuffle=shuffle, device=device)

Loading the dataset using Keras

import re
import tarfile
import numpy as np
from functools import reduce
from keras.utils.data_utils import get_file
from keras.preprocessing.sequence import pad_sequences
       path_new = get_file('babi-tasks-v1-2.tar.gz', origin='')
    print('Error downloading dataset, please download it manually:\n'
          '$ wget\n'
          '$ mv tasks_1-20_v1-2.tar.gz ~/.keras/datasets/babi-tasks-v1-2.tar.gz')
readfile= )

State of the Art

The current state of the art on bAbI dataset is STM. The model gave an accuracy of 99.85.

Natural Questions

Natural Questions contains 307,373 questions for training, 7,830 questions for development, and 7,842 questions for testing, alongside human-annotated answers from Wikipedia pages, to be utilized in preparing Question Answer frameworks. This dataset is the first to repeat start to finish measure wherein individuals discover answers to questions. It was developed by the researchers: Lin Pan, Rishav Chakravarti, Anthony Ferritto and Michael Glass.

Loading the dataset using TensorFlow

import bert
from bert import tokenization
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import hashlib
import glob
import os
from tensorflow.python.ops import math_ops
from collections import Counter
from tensorflow.metrics import accuracy
%matplotlib inline
import jsonlines
_train_file_path = '/Users/deniz/natural_questions/data/nq-train-*.jsonl'
training = glob.glob(_train_file_path)
examples = []
for _train_file in train_files:
    with as reader:
        for i, example in enumerate(reader):
            # pop ununsed keys
            del example['document_html']

State of the Art

The current state of the art on Natural Questions dataset is GPT-3 175B. The model gave an accuracy of 29.9.


In this article, we have covered some of the high-quality datasets that are used in Question Answering. Further, we implemented these data corpus using different Python Libraries. These datasets feature a diverse range of question and answer types. From the above result, we can see STM model performed exceptionally well on bAbI dataset with accuracy over 99%.

Download our Mobile App

Ankit Das
A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week.