Now Reading
Most Popular Datasets for Question Classification

Most Popular Datasets for Question Classification

question_answering

Questions Classification assumes a significant part in question answering systems, with one of the most important steps in the enhancement of the classification process being the identification of question types. The main aim of question classification is to anticipate the substance kind of the appropriate response of a natural language processing. Question order is regularly done using machine learning procedures. 

For example, Naive Bayes, k-Nearest Neighbors, and SVM calculation can be utilized to actualize the question classification. In some cases, Bag-of-Words and n-grams are the highlights used in the AI approach.

Deep Learning DevCon 2021 | 23-24th Sep | Register>>

The recent development in deep learning has demonstrated its ability in question classification. The CNN architecture models are equipped for extricating the elevated level highlights from the local text by window filters. Distinctive lexical, grammatical, and semantic highlights can be extracted from a question.

Here, we will discuss some of the popular datasets and their code implementation using TensorFlow and Pytorch. Further, we will discuss some of the benchmark models that gave high accuracy on these datasets.

Trec

The TREC dataset is used for question characterization consisting of open-area, reality-based inquiries partitioned into wide semantic classes. It has both a six-class (TREC-6) and a fifty-class (TREC-50) adaptation. Both have 5,452 preparing models and 500 test models, yet TREC-50 has better-grained names. The Text retrieval Conference(or TREC, co-supported by the National Institute of Standards and Technology and U.S. Division of Defense, was presented in 1992.

Looking for a job change? Let us help you.
question_classification

Loading the dataset using Pytorch

import os
from torchtext import data
class TREC(data.Dataset):
    urls = ['http://cogcomp.org/Data/QA/QC/train_5500.label',
            'http://cogcomp.org/Data/QA/QC/TREC_10.label']
    class_name = 'trec'
    directoryname = ''
    @staticmethod
    def sort_key(ex):
        return len(ex.text)
    def __init__(self, path, text_field, label_field,
                 fine_grained=False, **kwargs):
        fields = [('text', text_field), ('label', label_field)]
        exam1 = []
        def get_label_str(label):
            return label.split(':')[0] if not fine_grained else label
        label_field.preprocessing = data.Pipeline(get_label_str)
        for line in open(os.path.expanduser(path), 'rb'):
            label, _, text = line.replace(b'\xf0', b' ').decode().partition(' ')
            exam1.append(data.Example.fromlist([text, label], fields))
        super(TREC, self).__init__(exam1, fields, **kwargs)
    @classmethod
    def splits(cls, text_field, label_field, root='.data',
               train='train_5500.label', test='TREC_10.label', **kwargs):
        return super(TREC, cls).splits(
            root=root, text_field=text_field, label_field=label_field,
            train=train, validation=None, test=test, **kwargs)
    @aatash-shahedvancer-inclassmethod
    def iters(cls, batch_size=32, device=0, root='.data', vectors=None, **kwargs):
        text1 = data.Field()
        label1 = data.Field(sequential=False)
        train, test = cls.splits(text1, label1, root=root, **kwargs)
        text1.build_vocab(train, vectors=vectors)
        label1.build_vocab(train)
        return data.BucketIterator.splits(
            (train, test), batch_size=batch_size, device=device)

Loading the dataset using TensorFlow

import tensorflow as tf
import tensorflow_datasets.public_api as tfds
_URLs = {
    "train": "http://cogcomp.org/Data/QA/QC/train_5500.label",
    "test": "http://cogcomp.org/Data/QA/QC/TREC_10.label",
}
coarse_label = ["DESC", "ENTY", "ABBR", "HUM", "NUM", "LOC"]
fine_label = [ "manner","cremat","animal","exp","ind","title","date","reason","event","state","desc","count"]
class Trec(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("1.0.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        description=_DESCRIPTION,
        features=tfds.features.FeaturesDict({
            "label-coarse": tfds.features.ClassLabel(names=coarse_label),
            "label-fine": tfds.features.ClassLabel(names=fine_label),
            "text": tfds.features.Text(),
        }),
        homepage="https://cogcomp.seas.upenn.edu/Data/QA/QC/",
        citation=_CITATION,
    )

State of the Art

The current state of the art on Trec-6 dataset is USE_T+CNN. The model gave an error of 1.93.

trec-6

Source

UMICH S1650

UMICH is an information document containing genuine inquiry by the group presented on Yahoo Answers. The training data contains 2698 questions, and the test set has 1874 inquiries that are unlabeled. The question belongs to each of the categories:

1. Business 

2. Computers 

3. Entertainment

4. Music

4. Family

5. Education

6. Health

7. Science

Loading the dataset using TensorFlow

import tensorflow as tf
def UMICH(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= UMICH('https://www.kaggle.com/c/si650winter11/data')

ARC

ARC dataset comprises 7,787 different decision science questions to empower focused on blending of questions with explicit problem solvers. It contains 400 itemized question classes and issue spaces for these science test questions created dependent on test prospectuses, study guides, and detailed information analysis of the ARC questions. It was developed by the researchers: Dongfang Xu , Peter Jansen and Jaycie Martin.

Loading the dataset using TensorFlow

import tensorflow as tf
def ARC(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= ARC('https://www.kaggle.com/jeromeblanchet/arc-ai2-reasoning-challenge/ARC-Challenge-Dev.csv')

Conclusion

In this article, we have discussed some of the most popular datasets that are used in Question Classification. Further, we implemented these text corpus using Pytorch and TensorFlow. These datasets feature a diverse range of questions. As question classification is a critical criterion in the question-answering field, we can further implement various deep learning models to get high accuracy.

What Do You Think?

Join Our Discord Server. Be part of an engaging online community. Join Here.


Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top