Questions Classification assumes a significant part in question answering systems, with one of the most important steps in the enhancement of the classification process being the identification of question types. The main aim of question classification is to anticipate the substance kind of the appropriate response of a natural language processing. Question order is regularly done using machine learning procedures.
For example, Naive Bayes, k-Nearest Neighbors, and SVM calculation can be utilized to actualize the question classification. In some cases, Bag-of-Words and n-grams are the highlights used in the AI approach.
The recent development in deep learning has demonstrated its ability in question classification. The CNN architecture models are equipped for extricating the elevated level highlights from the local text by window filters. Distinctive lexical, grammatical, and semantic highlights can be extracted from a question.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Here, we will discuss some of the popular datasets and their code implementation using TensorFlow and Pytorch. Further, we will discuss some of the benchmark models that gave high accuracy on these datasets.
Trec
The TREC dataset is used for question characterization consisting of open-area, reality-based inquiries partitioned into wide semantic classes. It has both a six-class (TREC-6) and a fifty-class (TREC-50) adaptation. Both have 5,452 preparing models and 500 test models, yet TREC-50 has better-grained names. The Text retrieval Conference(or TREC, co-supported by the National Institute of Standards and Technology and U.S. Division of Defense, was presented in 1992.
Loading the dataset using Pytorch
import os from torchtext import data class TREC(data.Dataset): urls = ['http://cogcomp.org/Data/QA/QC/train_5500.label', 'http://cogcomp.org/Data/QA/QC/TREC_10.label'] class_name = 'trec' directoryname = '' @staticmethod def sort_key(ex): return len(ex.text) def __init__(self, path, text_field, label_field, fine_grained=False, **kwargs): fields = [('text', text_field), ('label', label_field)] exam1 = [] def get_label_str(label): return label.split(':')[0] if not fine_grained else label label_field.preprocessing = data.Pipeline(get_label_str) for line in open(os.path.expanduser(path), 'rb'): label, _, text = line.replace(b'\xf0', b' ').decode().partition(' ') exam1.append(data.Example.fromlist([text, label], fields)) super(TREC, self).__init__(exam1, fields, **kwargs) @classmethod def splits(cls, text_field, label_field, root='.data', train='train_5500.label', test='TREC_10.label', **kwargs): return super(TREC, cls).splits( root=root, text_field=text_field, label_field=label_field, train=train, validation=None, test=test, **kwargs) @aatash-shahedvancer-inclassmethod def iters(cls, batch_size=32, device=0, root='.data', vectors=None, **kwargs): text1 = data.Field() label1 = data.Field(sequential=False) train, test = cls.splits(text1, label1, root=root, **kwargs) text1.build_vocab(train, vectors=vectors) label1.build_vocab(train) return data.BucketIterator.splits( (train, test), batch_size=batch_size, device=device)
Loading the dataset using TensorFlow
import tensorflow as tf import tensorflow_datasets.public_api as tfds _URLs = { "train": "http://cogcomp.org/Data/QA/QC/train_5500.label", "test": "http://cogcomp.org/Data/QA/QC/TREC_10.label", } coarse_label = ["DESC", "ENTY", "ABBR", "HUM", "NUM", "LOC"] fine_label = [ "manner","cremat","animal","exp","ind","title","date","reason","event","state","desc","count"] class Trec(tfds.core.GeneratorBasedBuilder): VERSION = tfds.core.Version("1.0.0") def _info(self): return tfds.core.DatasetInfo( builder=self, description=_DESCRIPTION, features=tfds.features.FeaturesDict({ "label-coarse": tfds.features.ClassLabel(names=coarse_label), "label-fine": tfds.features.ClassLabel(names=fine_label), "text": tfds.features.Text(), }), homepage="https://cogcomp.seas.upenn.edu/Data/QA/QC/", citation=_CITATION, )
State of the Art
The current state of the art on Trec-6 dataset is USE_T+CNN. The model gave an error of 1.93.
UMICH S1650
UMICH is an information document containing genuine inquiry by the group presented on Yahoo Answers. The training data contains 2698 questions, and the test set has 1874 inquiries that are unlabeled. The question belongs to each of the categories:
1. Business
2. Computers
3. Entertainment
4. Music
4. Family
5. Education
6. Health
7. Science
Loading the dataset using TensorFlow
import tensorflow as tf def UMICH(path): data = tf.data.TextLineDataset(path) def content_filter(source): return tf.logical_not(tf.strings.regex_full_match( source, '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*')) data = data.filter(content_filter) data = data.map(lambda x: tf.strings.split(x, ' . ')) data = data.unbatch() return data train= UMICH('https://www.kaggle.com/c/si650winter11/data')
ARC
ARC dataset comprises 7,787 different decision science questions to empower focused on blending of questions with explicit problem solvers. It contains 400 itemized question classes and issue spaces for these science test questions created dependent on test prospectuses, study guides, and detailed information analysis of the ARC questions. It was developed by the researchers: Dongfang Xu , Peter Jansen and Jaycie Martin.
Loading the dataset using TensorFlow
import tensorflow as tf def ARC(path): data = tf.data.TextLineDataset(path) def content_filter(source): return tf.logical_not(tf.strings.regex_full_match( source, '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*')) data = data.filter(content_filter) data = data.map(lambda x: tf.strings.split(x, ' . ')) data = data.unbatch() return data train= ARC('https://www.kaggle.com/jeromeblanchet/arc-ai2-reasoning-challenge/ARC-Challenge-Dev.csv')
Conclusion
In this article, we have discussed some of the most popular datasets that are used in Question Classification. Further, we implemented these text corpus using Pytorch and TensorFlow. These datasets feature a diverse range of questions. As question classification is a critical criterion in the question-answering field, we can further implement various deep learning models to get high accuracy.