Last updated January 13, 2021
In AI Mysteries

Most Popular Datasets for Question Classification

Questions Classification assumes a significant part in question answering systems, with one of the most important steps in the enhancement of the classification process being the identification of question types. The main aim of question classification is to anticipate the substance kind of the appropriate response of a natural language processing. Question order is regularly done using machine learning procedures.

Published on November 25, 2020

by Ankit Das

For example, Naive Bayes, k-Nearest Neighbors, and SVM calculation can be utilized to actualize the question classification. In some cases, Bag-of-Words and n-grams are the highlights used in the AI approach.

The recent development in deep learning has demonstrated its ability in question classification. The CNN architecture models are equipped for extricating the elevated level highlights from the local text by window filters. Distinctive lexical, grammatical, and semantic highlights can be extracted from a question.

Here, we will discuss some of the popular datasets and their code implementation using TensorFlow and Pytorch. Further, we will discuss some of the benchmark models that gave high accuracy on these datasets.

Trec

The TREC dataset is used for question characterization consisting of open-area, reality-based inquiries partitioned into wide semantic classes. It has both a six-class (TREC-6) and a fifty-class (TREC-50) adaptation. Both have 5,452 preparing models and 500 test models, yet TREC-50 has better-grained names. The Text retrieval Conference(or TREC, co-supported by the National Institute of Standards and Technology and U.S. Division of Defense, was presented in 1992.

Loading the dataset using Pytorch

import os
from torchtext import data
class TREC(data.Dataset):
    urls = ['http://cogcomp.org/Data/QA/QC/train_5500.label',
            'http://cogcomp.org/Data/QA/QC/TREC_10.label']
    class_name = 'trec'
    directoryname = ''
    @staticmethod
    def sort_key(ex):
        return len(ex.text)
    def __init__(self, path, text_field, label_field,
                 fine_grained=False, **kwargs):
        fields = [('text', text_field), ('label', label_field)]
        exam1 = []
        def get_label_str(label):
            return label.split(':')[0] if not fine_grained else label
        label_field.preprocessing = data.Pipeline(get_label_str)
        for line in open(os.path.expanduser(path), 'rb'):
            label, _, text = line.replace(b'\xf0', b' ').decode().partition(' ')
            exam1.append(data.Example.fromlist([text, label], fields))
        super(TREC, self).__init__(exam1, fields, **kwargs)
    @classmethod
    def splits(cls, text_field, label_field, root='.data',
               train='train_5500.label', test='TREC_10.label', **kwargs):
        return super(TREC, cls).splits(
            root=root, text_field=text_field, label_field=label_field,
            train=train, validation=None, test=test, **kwargs)
    @aatash-shahedvancer-inclassmethod
    def iters(cls, batch_size=32, device=0, root='.data', vectors=None, **kwargs):
        text1 = data.Field()
        label1 = data.Field(sequential=False)
        train, test = cls.splits(text1, label1, root=root, **kwargs)
        text1.build_vocab(train, vectors=vectors)
        label1.build_vocab(train)
        return data.BucketIterator.splits(
            (train, test), batch_size=batch_size, device=device)

Loading the dataset using TensorFlow

import tensorflow as tf
import tensorflow_datasets.public_api as tfds
_URLs = {
    "train": "http://cogcomp.org/Data/QA/QC/train_5500.label",
    "test": "http://cogcomp.org/Data/QA/QC/TREC_10.label",
}
coarse_label = ["DESC", "ENTY", "ABBR", "HUM", "NUM", "LOC"]
fine_label = [ "manner","cremat","animal","exp","ind","title","date","reason","event","state","desc","count"]
class Trec(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("1.0.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        description=_DESCRIPTION,
        features=tfds.features.FeaturesDict({
            "label-coarse": tfds.features.ClassLabel(names=coarse_label),
            "label-fine": tfds.features.ClassLabel(names=fine_label),
            "text": tfds.features.Text(),
        }),
        homepage="https://cogcomp.seas.upenn.edu/Data/QA/QC/",
        citation=_CITATION,
    )

State of the Art

The current state of the art on Trec-6 dataset is USE_T+CNN. The model gave an error of 1.93.

Source

UMICH S1650

UMICH is an information document containing genuine inquiry by the group presented on Yahoo Answers. The training data contains 2698 questions, and the test set has 1874 inquiries that are unlabeled. The question belongs to each of the categories:

1. Business

2. Computers

3. Entertainment

4. Music

4. Family

5. Education

6. Health

7. Science

Loading the dataset using TensorFlow

import tensorflow as tf
def UMICH(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= UMICH('https://www.kaggle.com/c/si650winter11/data')

ARC

ARC dataset comprises 7,787 different decision science questions to empower focused on blending of questions with explicit problem solvers. It contains 400 itemized question classes and issue spaces for these science test questions created dependent on test prospectuses, study guides, and detailed information analysis of the ARC questions. It was developed by the researchers: Dongfang Xu , Peter Jansen and Jaycie Martin.

Loading the dataset using TensorFlow

import tensorflow as tf
def ARC(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= ARC('https://www.kaggle.com/jeromeblanchet/arc-ai2-reasoning-challenge/ARC-Challenge-Dev.csv')

Conclusion

In this article, we have discussed some of the most popular datasets that are used in Question Classification. Further, we implemented these text corpus using Pytorch and TensorFlow. These datasets feature a diverse range of questions. As question classification is a critical criterion in the question-answering field, we can further implement various deep learning models to get high accuracy.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

Ankit Das

A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.

Wayve AI Introduces LINGO-2, Making Driving Easy with Natural Language

In 5 Years, Coding will be Done in Natural Language

Democratize data analysis and insights generation through the seamless translation of Natural Language into SQL queries

What’s New in the Latest TensorFlow 2.16

Google Launches TensorFlow GNN 1.0 for Advanced Graph Neural Networks

AI-Powered Innovation: Lentra’s Role in Shaping the Future of Indian Banking

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

India is Making its Own AI Servers

Pritam Bordoloi

PLI scheme marks the beginning of India ‘s manufacturing venture

GPT-5 Likely to be Released After the US Elections

Donna Eva

Generative AI Jobs in India can Fetch You up to Rs 1 Crore

Siddharth Jindal

Top Editorial Picks

Meta Forces Developers Cite ‘Llama 3’ in their AI Development

Sukriti Gupta

Elon Musk Set to Meet Indian Spacetech Startups During Upcoming Visit

Shyam Nandan Upadhyay

Happiest Minds Technologies Acquires Macmillan Learning India, Expands Edutech Reach

Shritama Saha

Meta Releases Llama 3, Beats Claude 3 Sonnet and Gemini Pro 1.5

Mohit Pandey

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Featured

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Through the implementation of advanced data management methodologies, resilient data observability solutions, and cutting-edge AI frameworks, Course5 is spearheading the