Question Answering is becoming a more prominent research field whose aim is to provide more natural access to textual information than traditional document retrieval techniques. If you talk more particularly, the Question Answering system is a kind of search engine that responds to natural language questions with concise and precise answers. For example, given a question, ‘Where is Mount Everest located?’ In that case, the QA system should respond ‘Between Nepal and Tibet’ instead of returning a list of Himalaya Mountains.
Due to the explosion of the internet and the existence of several multicultural communities, one of the major challenges faced by this system is multilingual. In a multilingual scenario, it is expected that the QA system will be able to do: answer questions formulated in several languages and look for answers in several collections in different languages. There are two kinds of recognizable QA systems that manage information in different languages, i.e. cross-lingual QA system and a second multilingual QA system. The first one addresses the situation where questions are formulated in different languages from a single document. The second one performs a search over two or more document collections in different languages.
Both kinds of systems have some advantages over standard monolingual QA systems. They mainly allow users to access more information in an easier way than the monolingual system. Generally, a multilingual QA system can be described as an ensemble of several monolingual systems, where each one works on a different monolingual document collection. Under this schema, two additional tasks are required, the first translation of incoming questions to all target languages, and the second, the combination of relevant information extracted from different languages.
Today in this article, we will be discussing multilingual universal sentence encoders developed by the researchers at GoogleAI, which is trained in 16 different languages like Arabic, Chinese, English, French, etc. This module gives a strong performance on cross-lingual question-answer retrieval.
This module works on two signatures, i.e. one is ‘question encoder’, which encodes variable-length questions in any of the languages mentioned above, second is the ‘response encoder’ which encodes the answers, it takes two inputs one is actual answer related to the question and second is the context related to question’s answer.
Let’s implement the QA system;
Code Implementation Question Answer Retrieval
Install and import all dependencies:
! pip install -q nltk ! pip install -q tqdm ! pip install -q simpleneighbors[annoy] ! pip install -q tensorflow_text
import json import nltk import os import pprint import random import simpleneighbors import urllib from tqdm.notebook import tqdm from IPython.display import display, HTML import tensorflow.compat.v2 as tf import tensorflow_hub as hub from tensorflow_text import SentencepieceTokenizer nltk.download('punkt')
Load the model:
model_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3' model = hub.load(model_url)
Helper function includes a user-defined function to download the dataset, extracting questions and answers from the dataset and visualization.
def squad_download(url): return json.load(urllib.request.urlopen(url)) def sentences_from_squad_json(squad): all_sentences =  for data in squad['data']: for paragraph in data['paragraphs']: sentences = nltk.tokenize.sent_tokenize(paragraph['context']) all_sentences.extend(zip(sentences, [paragraph['context']] * len(sentences))) return list(set(all_sentences)) # remove duplicates def questions_from_squad_json(squad): questions =  for data in squad['data']: for paragraph in data['paragraphs']: for qas in paragraph['qas']: if qas['answers']: questions.append((qas['question'], qas['answers']['text'])) return list(set(questions)) def output_with_highlight(text, highlight): output = "<li> " i = text.find(highlight) while True: if i == -1: output += text break output += text[0:i] output += '<b>'+text[i:i+len(highlight)]+'</b>' text = text[i+len(highlight):] i = text.find(highlight) return output + "</li>\n" def display_nearest_neighbors(query_text, answer_text=None): query_embedding = model.signatures['question_encoder'](tf.constant([query_text]))['outputs'] search_results = index.nearest(query_embedding, n=num_results) if answer_text: result_md = ''' <p>Random Question from SQuAD:</p> <p> <b>%s</b></p> <p>Answer:</p> <p> <b>%s</b></p> ''' % (query_text , answer_text) else: result_md = ''' <p>Question:</p> <p> <b>%s</b></p> ''' % query_text result_md += ''' <p>Retrieved sentences : <ol> ''' if answer_text: for s in search_results: result_md += output_with_highlight(s, answer_text) else: for s in search_results: result_md += '<li>' + s + '</li>\n' result_md += "</ol>" display(HTML(result_md))
SQuAD dataset is used for the demo. It is a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles.
Here we will download data from SQuAD official page, which in JSON format, further; we will extract the questions, answers and context of answer as like below;
squad_url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json' squad_json = squad_download(squad_url) sentences_ = sentences_from_squad_json(squad_json) questions = questions_from_squad_json(squad_json) print(len(sentences_),'sentences and',len(questions),'are extracted from SQUAD json',squad_url) print('\nExample of sentence and context:\n') sentence = random.choice(sentences_) print('Sentence:\n') pprint.pprint(sentence) print('\ncontext:\n') pprint.pprint(sentence) print('\nEmaple of Q&A\n') question = random.choice(questions) print('Question:\n') pprint.pprint(question) print('\nAnswer:\n') pprint.pprint(question)
Infer the model:
To compute the semantic search, here API from ANNOY library called SimpleNeighbors is used, which calculates approximate nearest neighbours of items present in your dataset that are closest to any other item in the dataset or an arbitrary point in space defines your data.
batch_size = 120 encodings = model.signatures['response_encoder'](input=tf.constant(sentences_), context=tf.constant(sentences_)) index = simpleneighbors.SimpleNeighbors(len(encodings['outputs']), metric = 'angular') print('Computing embedding for',len(sentences_),'sentences') slices = zip(*(iter(sentences_),) * batch_size) num_batches = int(len(sentences_)/batch_size) for s in tqdm(slices,total=num_batches): response_batch = list([r for r, c in s]) context_batch = list([c for r, c in s]) encodings = model.signatures['response_encoder'](input=tf.constant(response_batch), context=tf.constant(context_batch)) for batch_index, batch in enumerate(response_batch): index.add_one(batch, encodings['outputs'][batch_index]) index.build() print('simpleneighbors index for',len(sentences_),'is built')
Test the model:
num_results = 10 query = random.choice(questions) display_nearest_neighbors(query, query)
This is all about the Question Answering System based on Multilingual universal sentence encoders. We have seen how QA systems are formed and their types; by such tools, we can build fully interactive NLP applications. Moreover, you can use another dataset present at the SQuAD platform and perform the same task or use your dataset.
- Link for Colab Notebook
- Official Implementation
- Multilingual sentence representation
- Cross lingual sentence representation