The question answering technologies these days are capable of answering straightforward questions using simple vocabulary which doesn’t involve multiple approaches to the meaning of a single word. However with thousands of different languages in existence, many of which use different approaches to construct meaning for a single word, it becomes difficult for the machine learning systems to understand the way these languages express meaning. Google has taken up the challenge of creating a corpus TyDi QA, which helps machine learning systems to train better when it comes to answering the questions for a set of diverse languages.
The Problem With Multilingual Questions
An example provided by Google when it comes to languages having ‘different approaches to construct a meaning’ is that of the Arabic language.
In English, the word for indicating a single object, say book, is ‘book’. When indicating many such objects, the word changes to ‘books’.
But, when it comes to Arabic, if there are two objects, the word ‘kitaban’ is used instead of using the plural ‘kutub’. The TyDi QA corpus tries to tackle this problem of languages having a different approach when it comes to answering the multilingual questions.
Here’s Where TyDi QA Comes Into Picture
Google’s TyDI QA corpus covers 11 Typologically Diverse languages. Inspired by typological diversity, TyDI QA includes over 2,00,000 question-answers pairs the 11 Typologically Diverse languages. These languages include Arabic, Bengali, Korean, Russian, Telugu and Tamil which contain non-Latin alphabet and other languages like Finnish, Indonesian, Kiswahili including Arabic form complex words. Also, these diverse sets of languages encounter the problem of availability of data over the web. Google introduced this corpus to counter these problems as they believe a system which can address these challenges will be successful for various other languages too.
The Science Behind Creating An Accurate Dataset Like TyDi QA
Google calls TyDi QA a benchmark in multilingual question answering, probably because of their different approach when it comes to creating the data. To construct a dataset which is accurate and natural, Google took a slightly different approach than the traditional one.
Earlier, the researchers created the QA datasets by asking people to read a paragraph and write questions which could be answered from within the paragraph they were shown.
Now, the questions these people came up with contained the same words from the paragraph that they read. And because the answers were supposed to be from the same paragraphs, some words repeated themselves from the question in the answers too.
This repeating of words resulted in data that just tipped the machine learning algorithms towards words matching rather than being able to answer these questions in a more meaningful way.
Here, Google followed the footsteps of human curiosity to create a more natural dataset. Google collected the questions from people who wanted an answer but did not know the answer yet. But, to get an answer, a passage must be interesting enough to tap into the curiosity of a human being and give birth to questions. Google showed people some exciting passages from Wikipedia written in their native language. Then these people were made to ask any question which was not answered by the passage, one which they actually wanted to know the answers to.
For each of these questions, a Google search was performed for the appropriate Wikipedia article in the relevant language. Then, the person who asked the question was made to find and highlight the answer within that article which was chosen for the question.
However, when one carries out such research, some deviation is expected. While Google did expect some gap between the questions and answers, the result was much more complicated.
They found that for some languages, the words appeared different in both questions and answers. Google demonstrated this with an example of the Finnish language:
The words day and week were represented in a different manner in both the question and the answer (Figure below). The system needs to able to recognise the relationship among the Finnish words like viikonpäivät, seitsenpäiväinen, and viikko.
In order to push the limits of creating better question-answering systems for user around the world, Google has created a TyDi QA leaderboard. The participants can evaluate their machine learning system on this leaderboard. So, TyDi QA dataset will be used more and more, which will only improve the quality of the QA systems around the world by making developers go through the challenges and rank higher globally. Also, Google has open-sourced a question answering system that uses this data.
If you loved this story, do join our Telegram Community.
Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.
What's Your Reaction?
Sameer is an aspiring Content Writer. Occasionally writes poems, loves food and is head over heels with Basketball.