Conversational agents are a dialogue system through NLP to respond to a given query in human language. It leverages advanced deep learning measures and natural language understanding to reach a point where conversational agents can transcend simple chatbot responses and make them more contextual. Conversational AI encompasses three main areas of artificial intelligence research — automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech (TTS or speech synthesis). These dialogue systems are utilised to read from the input channel and then reply with the relevant response in graphics, speech, or haptic-assisted physical gestures via the output channel.
Modern conversational models often struggle when confronted with temporal relationships or disfluencies.The capability of temporal reasoning in dialogs in massive pre-trained language models like T5 and GPT-3 is still largely under-explored. The progress on improving their performance has been slow, in part, because of the lack of datasets that involve this conversational and speech phenomena. To overcome these data set problems, Google has introduced two new datasets for conversational NLP.
Google’s published study investigates pre-trained language models for their temporal reasoning capabilities in dialogs using TimeDial and Disfl-QA. These help with temporal commonsense reasoning in dialogs and understanding contextual disfluencies, respectively. They are benchmark datasets to demonstrate the gap between human performance and current state of the art NLP models.
TimeDial makes it easier for conversational agents to have temporal conversations such as duration, frequency, or relative ordering of events in a dialog. Current NLP models tend to make a poor selection when tasked with filling in blank questions that demand a basic level of knowledge for reasoning or understanding temporal concepts. TimeDial introduces a multiple-choice span filling task targeted for temporal understanding.
For instance, we study this conversation shown on the Google AI Blog.
Credit: Google AI Blog
Determining the time needed for the NLP model to understand the temporal relationship between events such as half-past one comes before three o’clock and half-past three comes after both. It also demands them to have world knowledge to determine that the individual is not late for the meeting yet. But current models like T5 and BERT end up picking the wrong answers.
Fitting into this problem, Google’s TimeDial is a benchmark dataset that measures a model’s temporal commonsense reasoning abilities within the context of dialogue through four multiple-choice questions set up.
Google led an experiment across three modelling paradigms-
- classification over the provided four options using BERT
- mask filling for the masked span in the dialogue using BERT-MLM
- Generative methods using T5.
A quantitative error analysis concluded that the pre-trained language models could not truly reason over the context. Instead, they often rely on shallow and spurious features such as test matching. This calls for finding new ways of representing temporal objects in general text representations.
The dataset is publicly available at: https://github.com/google-research-datasets/timedial.
Disfluency occurs in the text output generated by speech recognition systems. Therefore, it is essential to study this disfluent text to build conversational agents that understand human speech. But research in NLP faces two hurdles:
- The lack of curated datasets obstructs deeper research and model innovation. Datasets generally contain these disfluencies.
- The available datasets are limited in scale and complexity.
These create a challenge for researchers to conduct a stress test of the NLP models.
Google has claimed Disfl-QA to be the first dataset containing contextual disfluencies in an information-seeking setting. It is a targeted dataset for disfluencies which comprises questions (12k) containing these sentence complications.
Disfl-QA comprises close to 90 percent of corrections or restarts which makes it a tough test for disfluency correction. In addition, it has a broader scope of semantic distractions, i.e., distractors that carry semantic meaning instead of simpler speech disfluencies.
Google demonstrated this with the help of an example.
Credit: Google AI Blog
In this sentence, Q1 is a question regarding the location of Normandy. However, in the disfluent version (DQ1), ‘Norse’ is mentioned before the question is corrected. This correctional disfluency confuses the QA model since it relied on shallow textual cues to answer the question.
According to their experiment results, the performance of existing language models was unsatisfactory when tested on Disfl-QA. Data augmentation methods can be used to recover this loss in performance partially. The researchers also found the need for large-scale disfluency datasets for NLP models to be robust to disfluencies.
The dataset is publicly available at: https://github.com/ google-research-datasets/disfl-qa.