MITB Banner

How Google’s TyDi QA Has Made It Easy For ML Systems To Answer Multilingual Question

Share

The question answering technologies these days are capable of answering straightforward questions using simple vocabulary which doesn’t involve multiple approaches to the meaning of a single word. However with thousands of different languages in existence, many of which use different approaches to construct meaning for a single word, it becomes difficult for the machine learning systems to understand the way these languages express meaning. Google has taken up the challenge of creating a corpus TyDi QA, which helps machine learning systems to train better when it comes to answering the questions for a set of diverse languages.

The Problem With Multilingual Questions

An example provided by Google when it comes to languages having ‘different approaches to construct a meaning’ is that of the Arabic language.

In English, the word for indicating a single object, say book, is ‘book’. When indicating many such objects, the word changes to ‘books’.

But, when it comes to Arabic, if there are two objects, the word ‘kitaban’ is used instead of using the plural ‘kutub’. The TyDi QA corpus tries to tackle this problem of languages having a different approach when it comes to answering the multilingual questions.

Here’s Where TyDi QA Comes Into Picture

Google’s TyDI QA corpus covers 11 Typologically Diverse languages. Inspired by typological diversity, TyDI QA includes over 2,00,000 question-answers pairs the 11 Typologically Diverse languages. These languages include Arabic, Bengali, Korean, Russian, Telugu and Tamil which contain non-Latin alphabet and other languages like Finnish, Indonesian, Kiswahili including Arabic form complex words. Also, these diverse sets of languages encounter the problem of availability of data over the web. Google introduced this corpus to counter these problems as they believe a system which can address these challenges will be successful for various other languages too. 

The Science Behind Creating An Accurate Dataset Like TyDi QA

Google calls TyDi QA a benchmark in multilingual question answering, probably because of their different approach when it comes to creating the data. To construct a dataset which is accurate and natural, Google took a slightly different approach than the traditional one.

Earlier, the researchers created the QA datasets by asking people to read a paragraph and write questions which could be answered from within the paragraph they were shown.

Now, the questions these people came up with contained the same words from the paragraph that they read. And because the answers were supposed to be from the same paragraphs, some words repeated themselves from the question in the answers too.

This repeating of words resulted in data that just tipped the machine learning algorithms towards words matching rather than being able to answer these questions in a more meaningful way.

Here, Google followed the footsteps of human curiosity to create a more natural dataset. Google collected the questions from people who wanted an answer but did not know the answer yet. But, to get an answer, a passage must be interesting enough to tap into the curiosity of a human being and give birth to questions. Google showed people some exciting passages from Wikipedia written in their native language. Then these people were made to ask any question which was not answered by the passage, one which they actually wanted to know the answers to.

For each of these questions, a Google search was performed for the appropriate Wikipedia article in the relevant language. Then, the person who asked the question was made to find and highlight the answer within that article which was chosen for the question.

However, when one carries out such research, some deviation is expected. While Google did expect some gap between the questions and answers, the result was much more complicated.

They found that for some languages, the words appeared different in both questions and answers. Google demonstrated this with an example of the Finnish language: 

The words day and week were represented in a different manner in both the question and the answer (Figure below). The system needs to able to recognise the relationship among the Finnish words like viikonpäivät, seitsenpäiväinen, and viikko.

Outlook

In order to push the limits of creating better question-answering systems for user around the world, Google has created a TyDi QA leaderboard. The participants can evaluate their machine learning system on this leaderboard. So, TyDi QA dataset will be used more and more, which will only improve the quality of the QA systems around the world by making developers go through the challenges and rank higher globally. Also, Google has open-sourced a question answering system that uses this data.

Share
Picture of Sameer Balaganur

Sameer Balaganur

Sameer is an aspiring Content Writer. Occasionally writes poems, loves food and is head over heels with Basketball.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.