Advertisement

Active Hackathon

Most Popular Datasets for Question Classification

Questions Classification assumes a significant part in question answering systems, with one of the most important steps in the enhancement of the classification process being the identification of question types. The main aim of question classification is to anticipate the substance kind of the appropriate response of a natural language processing. Question order is regularly done using machine learning procedures.
question_answering

Questions Classification assumes a significant part in question answering systems, with one of the most important steps in the enhancement of the classification process being the identification of question types. The main aim of question classification is to anticipate the substance kind of the appropriate response of a natural language processing. Question order is regularly done using machine learning procedures. 

For example, Naive Bayes, k-Nearest Neighbors, and SVM calculation can be utilized to actualize the question classification. In some cases, Bag-of-Words and n-grams are the highlights used in the AI approach.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

The recent development in deep learning has demonstrated its ability in question classification. The CNN architecture models are equipped for extricating the elevated level highlights from the local text by window filters. Distinctive lexical, grammatical, and semantic highlights can be extracted from a question.

Here, we will discuss some of the popular datasets and their code implementation using TensorFlow and Pytorch. Further, we will discuss some of the benchmark models that gave high accuracy on these datasets.

Trec

The TREC dataset is used for question characterization consisting of open-area, reality-based inquiries partitioned into wide semantic classes. It has both a six-class (TREC-6) and a fifty-class (TREC-50) adaptation. Both have 5,452 preparing models and 500 test models, yet TREC-50 has better-grained names. The Text retrieval Conference(or TREC, co-supported by the National Institute of Standards and Technology and U.S. Division of Defense, was presented in 1992.

question_classification

Loading the dataset using Pytorch

import os
from torchtext import data
class TREC(data.Dataset):
    urls = ['http://cogcomp.org/Data/QA/QC/train_5500.label',
            'http://cogcomp.org/Data/QA/QC/TREC_10.label']
    class_name = 'trec'
    directoryname = ''
    @staticmethod
    def sort_key(ex):
        return len(ex.text)
    def __init__(self, path, text_field, label_field,
                 fine_grained=False, **kwargs):
        fields = [('text', text_field), ('label', label_field)]
        exam1 = []
        def get_label_str(label):
            return label.split(':')[0] if not fine_grained else label
        label_field.preprocessing = data.Pipeline(get_label_str)
        for line in open(os.path.expanduser(path), 'rb'):
            label, _, text = line.replace(b'\xf0', b' ').decode().partition(' ')
            exam1.append(data.Example.fromlist([text, label], fields))
        super(TREC, self).__init__(exam1, fields, **kwargs)
    @classmethod
    def splits(cls, text_field, label_field, root='.data',
               train='train_5500.label', test='TREC_10.label', **kwargs):
        return super(TREC, cls).splits(
            root=root, text_field=text_field, label_field=label_field,
            train=train, validation=None, test=test, **kwargs)
    @aatash-shahedvancer-inclassmethod
    def iters(cls, batch_size=32, device=0, root='.data', vectors=None, **kwargs):
        text1 = data.Field()
        label1 = data.Field(sequential=False)
        train, test = cls.splits(text1, label1, root=root, **kwargs)
        text1.build_vocab(train, vectors=vectors)
        label1.build_vocab(train)
        return data.BucketIterator.splits(
            (train, test), batch_size=batch_size, device=device)

Loading the dataset using TensorFlow

import tensorflow as tf
import tensorflow_datasets.public_api as tfds
_URLs = {
    "train": "http://cogcomp.org/Data/QA/QC/train_5500.label",
    "test": "http://cogcomp.org/Data/QA/QC/TREC_10.label",
}
coarse_label = ["DESC", "ENTY", "ABBR", "HUM", "NUM", "LOC"]
fine_label = [ "manner","cremat","animal","exp","ind","title","date","reason","event","state","desc","count"]
class Trec(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("1.0.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        description=_DESCRIPTION,
        features=tfds.features.FeaturesDict({
            "label-coarse": tfds.features.ClassLabel(names=coarse_label),
            "label-fine": tfds.features.ClassLabel(names=fine_label),
            "text": tfds.features.Text(),
        }),
        homepage="https://cogcomp.seas.upenn.edu/Data/QA/QC/",
        citation=_CITATION,
    )

State of the Art

The current state of the art on Trec-6 dataset is USE_T+CNN. The model gave an error of 1.93.

trec-6

Source

UMICH S1650

UMICH is an information document containing genuine inquiry by the group presented on Yahoo Answers. The training data contains 2698 questions, and the test set has 1874 inquiries that are unlabeled. The question belongs to each of the categories:

1. Business 

2. Computers 

3. Entertainment

4. Music

4. Family

5. Education

6. Health

7. Science

Loading the dataset using TensorFlow

import tensorflow as tf
def UMICH(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= UMICH('https://www.kaggle.com/c/si650winter11/data')

ARC

ARC dataset comprises 7,787 different decision science questions to empower focused on blending of questions with explicit problem solvers. It contains 400 itemized question classes and issue spaces for these science test questions created dependent on test prospectuses, study guides, and detailed information analysis of the ARC questions. It was developed by the researchers: Dongfang Xu , Peter Jansen and Jaycie Martin.

Loading the dataset using TensorFlow

import tensorflow as tf
def ARC(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= ARC('https://www.kaggle.com/jeromeblanchet/arc-ai2-reasoning-challenge/ARC-Challenge-Dev.csv')

Conclusion

In this article, we have discussed some of the most popular datasets that are used in Question Classification. Further, we implemented these text corpus using Pytorch and TensorFlow. These datasets feature a diverse range of questions. As question classification is a critical criterion in the question-answering field, we can further implement various deep learning models to get high accuracy.

More Great AIM Stories

Ankit Das
A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

A Case for IT Professionals Switching Jobs Frequently

For Indian companies, the ability to retain employees has become a tight ropewalk between transforming their working models and adopting a hybrid working model successfully. Over 60% respondents in the Qualtrics survey said that they would look for a new job, if forced to return to work from office full time.

What to Expect from Tesla AI Day 2022

By the looks of the promotional campaign, we can assume that the Optimus humanoid robot will be the highlight of this year’s Tesla AI day