MITB Banner

10 Question-Answering Datasets To Build Robust Chatbot Systems

Share

One of the ways to build a robust and intelligent chatbot system is to feed question answering dataset during training the model. Question answering systems provide real-time answers that are essential and can be said as an important ability for understanding and reasoning.

In this article, we list down 10 Question-Answering datasets which can be used to build a robust chatbot.

1| SQuAD

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset which includes questions posed by crowd-workers on a set of Wikipedia articles and the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. The dataset was presented by researchers at Stanford University and SQuAD 2.0 contains more than 100,000 questions.

Get the dataset here

2| Natural Questions (NQ)

 Natural Questions (NQ) is a new, large-scale corpus for training and evaluating open-domain question answering systems. Presented by Google, this dataset is the first to replicate the end-to-end process in which people find answers to questions. It contains 300,000 naturally occurring questions, along with human-annotated answers from Wikipedia pages, to be used in training QA systems. Furthermore, researchers added 16,000 examples where answers (to the same questions) are provided by 5 different annotators which will be useful for evaluating the performance of the learned QA systems.   

Get the dataset here.

3| Question Answering in Context

Question Answering in Context (QuAC) is a dataset for modeling, understanding, and participating in information seeking dialog. In this dataset, instances consist of an interactive dialogue between two crowd workers which is a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and a teacher who answers the questions by providing short excerpts (spans) from the text. It contains 14K

information-seeking QA dialogs which include 100K QA pairs in total.

Get the dataset here.

4| Conversational Question Answering (Coca)

Conversational Question Answering (CoQA), pronounced as Coca is a large-scale dataset for building conversational question answering systems. The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation. The dataset contains 127,000+ questions with answers collected from 8000+ conversations. 

Get the dataset here.

5| HOTPOTQA

HOTPOTQA is a dataset which contains 113k Wikipedia-based question-answer pairs with four key features. These are questions that require finding and reasoning over multiple supporting documents to answer, the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas, sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain predictions and a new type of factoid comparison questions to test QA systems’ ability to extract relevant facts and perform necessary comparison.

Get the dataset here.

6| ELI5  

ELI5 (Explain Like I’m Five) is a longform question answering dataset. It is a large-scale, high-quality data set, together with web documents, as well as two pre-trained models. The dataset is created by Facebook and it comprises of 270K threads of diverse, open-ended questions that require multi-sentence answers.

Get the dataset here.

7| ShARC

Shaping Answers with Rules through Conversations (ShARC) is a QA dataset which requires logical reasoning, elements of entailment/NLI and natural language generation. The dataset consists of  32k task instances based on real-world rules and crowd-generated questions and scenarios. 

Get the dataset here.

8| MS MARCO

MS MARCO or Human Generated MAchine Reading COmprehension Dataset is a large-scale dataset created by Microsoft AI & Research. The dataset consists of 1,010,916 anonymized question which is sampled from Bing’s search query logs, each with a human-generated answer and 182,669 completely human rewritten generated answers. This dataset is mainly intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas

Get the dataset here.

9| TWEETQA

TWEETQA is a social media-focused question answering dataset. This dataset is created by the researchers at IBM and the University of California and can be viewed as the first large-scale dataset for QA over social media data. The dataset now includes 10,898 articles, 17,794 tweets, and 13,757 crowdsourced question-answer pairs. 

Get the dataset here.

10| NEWSQA

NewsQA is a challenging machine comprehension dataset of over 100,000 human-generated question-answer pairs. The dataset is collected from crowd-workers supply questions and answers based on a set of over 10,000 news articles from CNN, with answers consisting of spans of text from the corresponding articles. The dataset contains 119,633 natural language questions posed by crowd-workers on 12,744 news articles from CNN.

Get the dataset here.

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.