When it comes to addressing a critical situation and giving advice through languages, humans are a natural expert. Still, there remains a gap between humans and natural language processing benchmark to understand and use that language. To bridge this gap, a team of researchers – led by Rowan Zellers from Seattle – have published a new paper that introduces an AI challenge by the name ‘TuringAdvice’. This focuses on the creation of language models that can generate helpful and logical advice for human beings with the help of real-world challenges.
The author says that when we use language in the real world to communicate with each other — such as when we give advice or teach a concept to someone — there is rarely a universally correct answer to compare with, just a loose goal we want to achieve. We introduce a framework to narrow this gap between benchmarks and real-world language use. The TuringAdvice challenge is created by the University of Washington, and the Allen Institute of AI and a detailed research paper was released last week with the title ‘Evaluating Machines by their Real-World Language Use.’
The challenge is based on the RedditAdvice – a dataset and leaderboard for measuring progress. Specifically designed for this challenge, RedditAdvice is a crowd-sourced dataset of advice that was collected for two weeks with the most number of votes on the Reddit subcommunities. The sole objective of this challenge was to make a machine pass a piece of advice that can be labelled as helpful or better than a piece of human advice.
The author writes that there is a deep underlying issue: a gap between how humans use language in the real world, and what our evaluation methodology can measure. Today’s dominant paradigm is to study static datasets and to grade machines by the similarity of their output with predefined correct answers.
The team also released a static dataset – RedditAdvice 2019 – as part of the TuringAdvice launch, which consisted of 6,16,000 pieces of advice from 1,88,000 situations discussed by people on the Reddit community. A prior analysis points out that models, such as Google’s T5 which was introduced in 2019, are capable of writing advice moderators, but were found useful in 9% of cases. Furthermore, the team also evaluated different versions of Grover Transformer model and TF-IDF, but left out bidirectional NLP models like Google’s BERT as they are considered worst when it comes to generating text compared to left-to-right models.
Further development of AI in the TuringAdvice challenge could lead AI to deliver advice for humans and can also act as a virtual therapist. To ensure that the results are aligned with the existing real-world language, the team chose advice as to their test ground since people are familiar with it and it also overlaps with the core of various NLP tasks. They also chose it to create a constitutive motivation, which is usually experienced by humans responding to queries on Reddit. The evaluation of model performance was straightway derived from humans hired through Amazon’s Technical Turk. The hiring from Mechanical Turk was done since it was more ethical in nature rather than simply posting automated machine advice responses to humans in need.
The only concern at hand for the TuringAdvice challenge is the pricing, which stands at $370 for the evaluation of 200 pieces of advice on Mechanical Turk. Those who are willing to participate will have to pay the fee for Mechanical Turk so that their models can be evaluated which may appear on the TuringAdvice challenge leaderboard.
TuringAdvice challenge is the latest step to develop natural language models. In the recent past, the University of Washington’s NLP lab along with researchers from New York University, Facebook AI Research, and Samsung Research created the SuperGLUE challenge and leaderboard to evaluate the performance of more complex tasks.