Now Reading
Guide To Question Answer Retrieval With Multilingual Universal Sentence Encoder

Guide To Question Answer Retrieval With Multilingual Universal Sentence Encoder

Question Answering is becoming a more prominent research field whose aim is to provide more natural access to textual information than traditional document retrieval techniques. If you talk more particularly, the Question Answering system is a kind of search engine that responds to natural language questions with concise and precise answers. For example, given a question, ‘Where is Mount Everest located?’ In that case, the QA system should respond ‘Between Nepal and Tibet’ instead of returning a list of Himalaya Mountains.

Due to the explosion of the internet and the existence of several multicultural communities, one of the major challenges faced by this system is multilingual. In a multilingual scenario, it is expected that the QA system will be able to do: answer questions formulated in several languages and look for answers in several collections in different languages. There are two kinds of recognizable QA systems that manage information in different languages, i.e. cross-lingual QA system and a second multilingual QA system. The first one addresses the situation where questions are formulated in different languages from a single document. The second one performs a search over two or more document collections in different languages.

Deep Learning DevCon 2021 | 23-24th Sep | Register>>

Both kinds of systems have some advantages over standard monolingual QA systems. They mainly allow users to access more information in an easier way than the monolingual system. Generally, a multilingual QA system can be described as an ensemble of several monolingual systems, where each one works on a different monolingual document collection. Under this schema, two additional tasks are required, the first translation of incoming questions to all target languages, and the second, the combination of relevant information extracted from different languages. 

Today in this article, we will be discussing multilingual universal sentence encoders developed by the researchers at GoogleAI, which is trained in 16 different languages like Arabic, Chinese, English, French, etc. This module gives a strong performance on cross-lingual question-answer retrieval. 

This module works on two signatures, i.e. one is ‘question encoder’, which encodes variable-length questions in any of the languages mentioned above, second is the ‘response encoder’ which encodes the answers, it takes two inputs one is actual answer related to the question and second is the context related to question’s answer.

Follow us on Google News>>

Let’s implement the QA system;

Code Implementation Question Answer Retrieval

Install and import all dependencies:
! pip install -q nltk
! pip install -q tqdm
! pip install -q simpleneighbors[annoy]
! pip install -q tensorflow_text
import json
import nltk
import os
import pprint
import random
import simpleneighbors
import urllib
from tqdm.notebook import tqdm
from IPython.display import display, HTML
import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer'punkt')
Load the model:
model_url = ''
model = hub.load(model_url)
Helper Functions:

Helper function includes a user-defined function to download the dataset, extracting questions and answers from the dataset and visualization.   

def squad_download(url):
  return json.load(urllib.request.urlopen(url))

def sentences_from_squad_json(squad):
  all_sentences = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      sentences = nltk.tokenize.sent_tokenize(paragraph['context'])
      all_sentences.extend(zip(sentences, [paragraph['context']] * len(sentences)))

  return list(set(all_sentences)) # remove duplicates

def questions_from_squad_json(squad):
  questions = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      for qas in paragraph['qas']:
        if qas['answers']:
          questions.append((qas['question'], qas['answers'][0]['text']))

  return list(set(questions))

def output_with_highlight(text, highlight):
  output = "<li> "
  i = text.find(highlight)
  while True:
    if i == -1:
      output += text
    output += text[0:i]
    output += '<b>'+text[i:i+len(highlight)]+'</b>'
    text = text[i+len(highlight):]
    i = text.find(highlight)

  return output + "</li>\n"

def display_nearest_neighbors(query_text, answer_text=None):
  query_embedding = model.signatures['question_encoder'](tf.constant([query_text]))['outputs'][0]
  search_results = index.nearest(query_embedding, n=num_results)
  if answer_text:
    result_md = '''
    <p>Random Question from SQuAD:</p>
    ''' % (query_text , answer_text)
    result_md = '''
    ''' % query_text
  result_md += '''
    <p>Retrieved sentences :
  if answer_text:
    for s in search_results:
      result_md += output_with_highlight(s, answer_text)
    for s in search_results:
      result_md += '<li>' + s + '</li>\n'

  result_md += "</ol>"
Load data:

SQuAD dataset is used for the demo. It is a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles. 

Here we will download data from SQuAD official page, which in JSON format, further; we will extract the questions, answers and context of answer as like below;

squad_url = ''
squad_json = squad_download(squad_url)
sentences_ = sentences_from_squad_json(squad_json)
questions = questions_from_squad_json(squad_json)

print(len(sentences_),'sentences and',len(questions),'are extracted from SQUAD json',squad_url)
print('\nExample of sentence and context:\n')
sentence = random.choice(sentences_)
print('\nEmaple of Q&A\n')
question = random.choice(questions)


Infer the model:

To compute the semantic search, here API from ANNOY library called SimpleNeighbors is used, which calculates approximate nearest neighbours of items present in your dataset that are closest to any other item in the dataset or an arbitrary point in space defines your data.

batch_size = 120
encodings = model.signatures['response_encoder'](input=tf.constant(sentences_[0][0]),
index = simpleneighbors.SimpleNeighbors(len(encodings['outputs'][0]),
                                        metric = 'angular')

print('Computing embedding for',len(sentences_),'sentences')
slices = zip(*(iter(sentences_),) * batch_size)
num_batches = int(len(sentences_)/batch_size)
for s in tqdm(slices,total=num_batches):
  response_batch = list([r for r, c in s])
  context_batch = list([c for r, c in s])
  encodings = model.signatures['response_encoder'](input=tf.constant(response_batch),
  for batch_index, batch in enumerate(response_batch):
    index.add_one(batch, encodings['outputs'][batch_index])
print('simpleneighbors index for',len(sentences_),'is built')
Test the model: 
num_results = 10
query = random.choice(questions)
display_nearest_neighbors(query[0], query[1])



This is all about the Question Answering System based on Multilingual universal sentence encoders. We have seen how QA systems are formed and their types; by such tools, we can build fully interactive NLP applications. Moreover, you can use another dataset present at the SQuAD platform and perform the same task or use your dataset.


What Do You Think?

Join Our Discord Server. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top