MITB Banner

Most Popular Datasets for Question Classification

Questions Classification assumes a significant part in question answering systems, with one of the most important steps in the enhancement of the classification process being the identification of question types. The main aim of question classification is to anticipate the substance kind of the appropriate response of a natural language processing. Question order is regularly done using machine learning procedures.
Share
question_answering

Questions Classification assumes a significant part in question answering systems, with one of the most important steps in the enhancement of the classification process being the identification of question types. The main aim of question classification is to anticipate the substance kind of the appropriate response of a natural language processing. Question order is regularly done using machine learning procedures. 

For example, Naive Bayes, k-Nearest Neighbors, and SVM calculation can be utilized to actualize the question classification. In some cases, Bag-of-Words and n-grams are the highlights used in the AI approach.

The recent development in deep learning has demonstrated its ability in question classification. The CNN architecture models are equipped for extricating the elevated level highlights from the local text by window filters. Distinctive lexical, grammatical, and semantic highlights can be extracted from a question.

Here, we will discuss some of the popular datasets and their code implementation using TensorFlow and Pytorch. Further, we will discuss some of the benchmark models that gave high accuracy on these datasets.

Trec

The TREC dataset is used for question characterization consisting of open-area, reality-based inquiries partitioned into wide semantic classes. It has both a six-class (TREC-6) and a fifty-class (TREC-50) adaptation. Both have 5,452 preparing models and 500 test models, yet TREC-50 has better-grained names. The Text retrieval Conference(or TREC, co-supported by the National Institute of Standards and Technology and U.S. Division of Defense, was presented in 1992.

question_classification

Loading the dataset using Pytorch

import os
from torchtext import data
class TREC(data.Dataset):
    urls = ['http://cogcomp.org/Data/QA/QC/train_5500.label',
            'http://cogcomp.org/Data/QA/QC/TREC_10.label']
    class_name = 'trec'
    directoryname = ''
    @staticmethod
    def sort_key(ex):
        return len(ex.text)
    def __init__(self, path, text_field, label_field,
                 fine_grained=False, **kwargs):
        fields = [('text', text_field), ('label', label_field)]
        exam1 = []
        def get_label_str(label):
            return label.split(':')[0] if not fine_grained else label
        label_field.preprocessing = data.Pipeline(get_label_str)
        for line in open(os.path.expanduser(path), 'rb'):
            label, _, text = line.replace(b'\xf0', b' ').decode().partition(' ')
            exam1.append(data.Example.fromlist([text, label], fields))
        super(TREC, self).__init__(exam1, fields, **kwargs)
    @classmethod
    def splits(cls, text_field, label_field, root='.data',
               train='train_5500.label', test='TREC_10.label', **kwargs):
        return super(TREC, cls).splits(
            root=root, text_field=text_field, label_field=label_field,
            train=train, validation=None, test=test, **kwargs)
    @aatash-shahedvancer-inclassmethod
    def iters(cls, batch_size=32, device=0, root='.data', vectors=None, **kwargs):
        text1 = data.Field()
        label1 = data.Field(sequential=False)
        train, test = cls.splits(text1, label1, root=root, **kwargs)
        text1.build_vocab(train, vectors=vectors)
        label1.build_vocab(train)
        return data.BucketIterator.splits(
            (train, test), batch_size=batch_size, device=device)

Loading the dataset using TensorFlow

import tensorflow as tf
import tensorflow_datasets.public_api as tfds
_URLs = {
    "train": "http://cogcomp.org/Data/QA/QC/train_5500.label",
    "test": "http://cogcomp.org/Data/QA/QC/TREC_10.label",
}
coarse_label = ["DESC", "ENTY", "ABBR", "HUM", "NUM", "LOC"]
fine_label = [ "manner","cremat","animal","exp","ind","title","date","reason","event","state","desc","count"]
class Trec(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("1.0.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        description=_DESCRIPTION,
        features=tfds.features.FeaturesDict({
            "label-coarse": tfds.features.ClassLabel(names=coarse_label),
            "label-fine": tfds.features.ClassLabel(names=fine_label),
            "text": tfds.features.Text(),
        }),
        homepage="https://cogcomp.seas.upenn.edu/Data/QA/QC/",
        citation=_CITATION,
    )

State of the Art

The current state of the art on Trec-6 dataset is USE_T+CNN. The model gave an error of 1.93.

trec-6

Source

UMICH S1650

UMICH is an information document containing genuine inquiry by the group presented on Yahoo Answers. The training data contains 2698 questions, and the test set has 1874 inquiries that are unlabeled. The question belongs to each of the categories:

1. Business 

2. Computers 

3. Entertainment

4. Music

4. Family

5. Education

6. Health

7. Science

Loading the dataset using TensorFlow

import tensorflow as tf
def UMICH(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= UMICH('https://www.kaggle.com/c/si650winter11/data')

ARC

ARC dataset comprises 7,787 different decision science questions to empower focused on blending of questions with explicit problem solvers. It contains 400 itemized question classes and issue spaces for these science test questions created dependent on test prospectuses, study guides, and detailed information analysis of the ARC questions. It was developed by the researchers: Dongfang Xu , Peter Jansen and Jaycie Martin.

Loading the dataset using TensorFlow

import tensorflow as tf
def ARC(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= ARC('https://www.kaggle.com/jeromeblanchet/arc-ai2-reasoning-challenge/ARC-Challenge-Dev.csv')

Conclusion

In this article, we have discussed some of the most popular datasets that are used in Question Classification. Further, we implemented these text corpus using Pytorch and TensorFlow. These datasets feature a diverse range of questions. As question classification is a critical criterion in the question-answering field, we can further implement various deep learning models to get high accuracy.

PS: The story was written using a keyboard.
Picture of Ankit Das

Ankit Das

A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.
Related Posts

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed