Active Hackathon

Google Sets A New Benchmark For Question Answering Research With A QA System

The black cat has crossed the road — this sentence might sound simple but when you try to translate it to native languages, all the historical inferences and semantic sophistications come into play. For some feline lover, that sentence might remind them of fluffy cat pictures and crossing the road is unimportant to them. For some, it could be an ominous warning, and for others a sign of prosperity. A harmless statement like this can throw amateur linguists into disarray.


Sign up for your weekly dose of what's up in emerging technology.

Questions such as these regarding the structure of language have been persistent for quite some time. With the advent of machine learning, NLP tasks are hotter than ever.

Our understanding, or the lack of it, plays a very trivial role in task-specific machine translations. But at a more generalised level, it gets tricky if machines are tasked to respond to a French query in Hindi or to a question which has a pun backed up by a localised cultural inference.

Open-domain question answering (QA) is a benchmark task in natural language understanding (NLU).

AI researchers and linguists have been collaborating to figure out a way to supplement the pursuit of General AI with a structure, universal at its core and flexible in its deployment.

This paper introduces Natural Questions (NQ), a new dataset for QA research, along with methods for QA system evaluation.

In contrast to tasks where it is relatively easy to gather naturally occurring examples, the definition of a suitable QA task, and the development of a methodology for annotation and evaluation is challenging.

Modelling An Annotator

Annotation decision process via paper by Tom Kwiatkowski et al.,

When an annotator is asked a question, it returns a longer version from the paragraphs of Wikipedia and also a short answer like a yes or no.

An example query from the Corpus looks like this:

Question: can you make and receive calls on airplane mode

Wikipedia Page: Airplane mode

Long answer: Airplane mode, aeroplane mode, flight mode, offline mode, or standalone mode is a setting available on many smartphones, portable computers, and other electronic devices that, when activated, suspends radio-frequency signal transmission by the device, thereby disabling Bluetooth, telephony, and Wi-Fi. GPS may or may not be disabled, because it does not involve transmitting radio waves.

Short answer: BOOLEAN:NO

The question seeks factual information; the Wikipedia page may or may not contain the information required to answer the question; the long answer is a bounding box on this page containing all information required to infer the answer; and the short answer is one or more entities that give a short answer to the question, or a boolean ‘yes’ or ‘no’. Both the long and short answer can be NULL if no viable candidates exist on the Wikipedia page.

The questions consist of real anonymized, aggregated queries issued to the Google search engine. Simple heuristics are used to filter questions from the query stream. Thus the questions are “natural”, in that they represent real queries from people seeking information. The corpus contains 307,373 training examples with single annotations, 7,830 examples with 5-way annotations for development data, and 7,842 5-way annotated items sequestered as test data.

Long and short answers of high quality have 90% and 84% precision respectively.

One clear finding in NQ Is that for naturally occurring questions there is often genuine ambiguity in whether or not an answer is.

The Rationale Behind This Model

The researchers tried multiple annotation approaches to make the model more robust. One such example is when the annotator(25-way) was asked- ‘where is blood pumped after it leaves the right ventricle’. Of 25, there were 11 correct answers and 14 responses with sub-strings linking to ‘lungs’.

The idea here is to identify popular answers for the longer version with the assumption that it is highly rare for a question to have more than 3 distinct long answers annotated.

If at least 2 out of 5 annotators have given a non-null long answer on the example, then the system is required to output a non-null answer that is seen at least once in the 5 annotations; conversely if fewer than 2 annotators give a non-null long answer, the system is required to return NULL as its output.

Key Takeaways

The goal of this research was to:

  1. provide large-scale end-to-end training data for the QA problem.
  2. provide a dataset that drives research in natural language understanding.
  3. study human performance in providing. QA annotations for naturally occurring questions.

This is the first large publicly available dataset to pair real user queries with high-quality annotations of answers in documents. And, also the metrics to be used with NQ, for the purposes of evaluating the performance of question answering systems have presented this paper. The researchers at Google demonstrate a high upper bound on these metrics and show that existing methods do not approach this upper bound.

This paper certainly pushes the boundaries of this vast field of natural language understanding while challenging pre-existing models in an attempt to realise the goal of large scale deployment of more efficient AI platforms.


More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
How Data Science Can Help Overcome The Global Chip Shortage

China-Taiwan standoff might increase Global chip shortage

After Nancy Pelosi’s visit to Taiwan, Chinese aircraft are violating Taiwan’s airspace. The escalation made TSMC’s chairman go public and threaten the world with consequences. Can this move by China fuel a global chip shortage?

Another bill bites the dust

The Bill had faced heavy criticism from different stakeholders -citizens, tech firms, political parties since its inception

So long, Spotify

‘TikTok Music’ is set to take over the online streaming space, but there exists an app that has silently established itself in the Indian market.